-
Notifications
You must be signed in to change notification settings - Fork 12.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rustdoc: Poor CPU usage on Windows during thousands of doctests #63638
Comments
I've explored replacing the in-proceess rustc functionality for doctests with writing out doctests and spawning rustc (though not fully fleshed out, e.g. linker arguments and such are not passed down). With my patch, and cpu-usage-over-time reports ~10% idle; libcore doctests finish in 80 seconds. Without, cpu-usage reports ~40% idle, and libcore doctests complete in 140 seconds. This seems like a viable strategy (nearly 2x improvement in runtime on 8 core/16 thread machine). I can work on fleshing out the patch if we think this is a viable approach, initial work is up here: Mark-Simulacrum@d8fee19. I also tried to do a little investigating into where the slowdown is, but wasn't able to conclude much. It does seem like we're seeing a lot of threads blocked on acquiring jobserver tokens, and there are seemingly dozens of threads being spawned per second, and there seems to be an abnormally high |
something that I noticed when I was looing in to this two years ago was that if I decreased the RUST_TEST_THREADS to 6 for my 32 thread machine I got a 2x runtime improvement. at that time (before measureme) I created som chrome trace files for this, links to the log files can be found on the https://andjo403.github.io/rs_tracing/ page (I can only open the log files in chrome). but maybe is better to make new traces with measureme if it is possible to do in rustdoc or maybe this type of traces do not giva any actionable information. |
created some new traces with hacking the rustc self-profile flag to get it working in rustdoc you can find the traces in https://andjo403.github.io/rustdoc_selfprofile/ the out of process looks super nice I got way more than 2x (5,4s to 0,7s) in my small tests with my 32 threads cpu. and 100% cpu usage. with the out of process fix the part of the compilation that varies the most depending on how many concurrent test that is running is the "running_linker" part. also tried to add with the out of process solution shall rustdoc start a jobbserver as it before was only one process it was ok the session created an jobbserver with 32 tokens but now that will be done for each test (rustc call). |
Nice work @Mark-Simulacrum and thanks for helping to confirm @andjo403! Sounds like an out-of-process solution would fix the lion's share of issues here perhaps, and so long as it jumps back up to close to 100% cpu usage I think that's safe to say it'd close this issue! It's a good point that the jobserver and/or limiting parallelism isn't done very well today. I think though that clamping the codegen units to 1 is probably a good idea since doctests are almost always very small and don't really benefit from parallelism anyway. Additionally clamping to one codegen unit should avoid the need to deal with the jobserver for now since parallelism only comes from codegen units. Eventually with a parallel |
I've marked this issue as E-mentor since I think I can provide sufficient guidance for someone interested in solving this (I probably won't get around to it for some time). What we want to do here is take the patch in Mark-Simulacrum@d8fee19 and finish threading through all of the options to rustc via command line arguments. There's quite a few, but I largely suspect this should be a long but relatively simple task. We'll probably discover a regression here or there once we land this for esoteric configurations where we don't quite match up with argument parsing and what we were passing in directly, but we can deal with that as they arise I suspect. Once the options are threaded through I think that patch will be mostly done -- we'll want to perhaps land the codegen units change in a separate patch. One outstanding question that will need some thought is identifying the rustc binary to run; right now the patch looks at Happy to provide more guidance here if it'd be helpful! @andjo403 Would you perhaps be interested in taking this patch on? If not, no worries! |
I can to try implement this. |
I would start off by just interpreting the session parameters (they should be close to 1-1 and mostly lossless); I think there's possibly potential for cleanup but I could see the config::options being more complicated to re-stringify into rustc args. Whatever works best though when you get into the weeds is fine by me though! |
@alexcrichton @Mark-Simulacrum some strange things that I found was the debugging options (-Z ) was not used in the generation of the executable, so have not added that now and the rustdoc linker option overwrites the linker codegen option (-C linker) so still do that. |
and also maybe needs to print the stdout and stderr in some other way then |
now all the test in src/test/rustdoc passes for andjo403@2661f79 |
Could you post a PR with your changes? That'll make it easier to keep feedback central and such. |
Out-of-process rustdoc would also help a lot with rust-lang/miri#584. Would be great if a design could be picked that lets the environment override the compiler used for building the test, so that we can run them in Miri instead. :) |
rustdoc use -Ccodegen-units=1 by default for test compile as the test is small we do not want split up in multiple codegen units and also as there is multiple test running at the same time this will reduce the number of concurrent threads tested the test time with `./x.py test src/libcore --doc` for my 16 core 32 thread cpu i get about 6% faster execution and my 2 core 4 thread cpu I get about 10% faster execution cc #63638 r? @Mark-Simulacrum
…lacrum rustdoc use -Ccodegen-units=1 by default for test compile as the test is small we do not want split up in multiple codegen units and also as there is multiple test running at the same time this will reduce the number of concurrent threads tested the test time with `./x.py test src/libcore --doc` for my 16 core 32 thread cpu i get about 6% faster execution and my 2 core 4 thread cpu I get about 10% faster execution cc rust-lang#63638 r? @Mark-Simulacrum
maybe shall open this again as the windows version still do not use 100% cpu and execution time is the same or a small regression. |
…lacrum rustdoc use -Ccodegen-units=1 by default for test compile as the test is small we do not want split up in multiple codegen units and also as there is multiple test running at the same time this will reduce the number of concurrent threads tested the test time with `./x.py test src/libcore --doc` for my 16 core 32 thread cpu i get about 6% faster execution and my 2 core 4 thread cpu I get about 10% faster execution cc rust-lang#63638 r? @Mark-Simulacrum
I suspect that windows is blocked on the emitter lock, but don't know. I don't personally think it's too worthwhile to reopen this unless we can get similar stats as with Linux (e.g., is it using 90%+ on Windows? Far less?) |
when I use the Something that I was thinking of is that it creates a new dir for every test is that not slow to do that on windows? maybe change the out files to have other names instead of using dirs. |
Interesting. We can reopen in that case, though I wouldn't personally know where to or how to look at performance bottlenecks on Windows. |
When running the doctests for the
core
crate on my 28-core system the system unfortunately never really gets above 40% cpu usage, yet the doctests being compiled are embarassingly parallel and should result in full usage of the cores available to execute.For now this can be reproduced with
./x.py test src/libcore --doc
and watching CPU usage viapython src/ci/cpu-usage-over-time.py
One suspicion I have for this is that rustdoc has always use in-process compilation for all the doctests, but there may be an excessive amount of synchronization in LLVM or rustc which prevents all CPU cores from being used. I'm not sure if this is the cause though.
The text was updated successfully, but these errors were encountered: