-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rhel8 arm64 throws NullReferenceExceptions #43349
Comments
@tmds you can run the test under lldb. The debugger should break in right at the place where the null reference happened. Then you can use SOS commands (provided you have SOS installed - see https://github.com/dotnet/diagnostics/blob/master/documentation/installing-sos-instructions.md) to disassemble the managed method, view call stack including managed frames and their locals and arguments (if they are available in stack slots and current registers), dump managed objects etc. Essential SOS commands:
Documentation for SOS commands supported on Unix can be found at https://github.com/dotnet/diagnostics/blob/c7bc44208fd1c10abc6d4258eb29de0906d2a22e/src/SOS/Strike/sosdocsunix.txt Are these exceptions happening in specific tests in a reproducible manner or do they seem to be random, hitting different tests each run? |
I also wonder - are you referring to dotnet/runtime CI or some Redhat's internal CI? |
If it is our CI, I can definitely take a look myself. |
It's on Red Hat internal CI. I don't think dotnet/runtime CI includes rhel8 arm64? The results differ on each run. In last two runs
|
This didn't work. |
It seems it might be related to something with capturing / restoring context around GC suspension, the FlushProcessWriteBuffers not working or something of that kind. |
It is supported:
I have trouble getting this to reproduce with the debugger. Is there a patch I can make that will call |
The easiest way is to put abort into the sigsegv_handler here:
Please note that would mean that even NullReferenceExceptions that would otherwise be handled would abort and I guess we have tests that catch these. |
I've made the change and eliminated some tests so they pass on my |
CI produced two coredumps. lldb doesn't like them:
printing native stacks using gdb:
managed stacks using dotnet-dump:
I find it weird @janvorli do you see something useful in here? Are there some other tests I could run that may show something interesting? |
Have you opened the dump on the same machine where it was generated? |
At the end of the CI build, the testhost and coredumps are collected. I opened those on another arm64 rhel8 machine. |
I ran a few experiments which suggest the issue is in the rhel8 kernel. |
@janvorli do you have some suggestion on what stand-alone applications I may try to run that could trigger this issue? So far I'm running the whole build+tests and see only 1-5 occurrences at random places. It would be nice if I could find something smaller but I don't know what I'm looking for. |
It is hard to say what app could repro the problem when we don't know where it is stemming from. However, I would recommend trying to repro it with coreclr pri 1 tests running with GC stress 3 (checked build of coreclr is needed for that to work). My guess is that it could raise the frequency of the problem considerably and might even get it repro in 90-100% cases on certain tests. Then you can pick one of such tests and try to run it with GC stress enabled under lldb to debug it. |
I'm not familiar with running those tests. These are the commands I'm using: $ ./build.sh clr+libs -rc checked --librariesConfiguration Release /p:NoPgoOptimize=true
$ ./src/tests/build.sh arm64 checked I got about 25
@janvorli maybe this tells you something? I tried to invoke this command a couple of time, but the build doesn't seem to be incremental. So it starts over and errors out at some point. Probably there are some tests already I can run, but I don't know how to start them. $ ./src/tests/run.sh arm64 checked --gcstresslevel=3
Running on CPU- arm64
testRootDir and other existing arguments is no longer required. If the
default location is incorrect or does not exist, please use
--testRootDir to explicitly override the defaults.
Build Architecture : arm64
Build Configuration : Checked
python /root/runtime/src/tests/../../src/tests/run.py -arch arm64 -build_type Checked
Error, Core_Root could not be determined, or points to a location that doesn't exist.
Yes, it should. I had to update |
Ok, if the test build fails consistently, then you can build them using those two commands you've tried, but on a different Linux arm64 distro and then copy over everything under the artifacts/tests/Linux.arm64.Checked from the build machine to the same subfolder of the runtime repo on your RHEL 8 machine. If you have the distro on the same absolute path on both machines, source level debugging should just work after the build. If the build of tests didn't complete, you cannot most likely run anything. The test build has several phases and the last phase is building test wrappers for all the tests to allow running them using xunit. |
I compiled on Fedora, and then ran on RHEL8. Unfortunately, the tests did not run due to the glibc version being older, so I added a Fedora container in the middle to workaround it. I forgot to add the 4 tests failed, the errors are below the results table. @janvorli do you see something interesting?
|
Without the priority1 option, you've run just the priority 0 tests, which is just a fraction of all the 10000+ tests. As for the failures, these same tests keep failing on my local Ubuntu 16.04 repo too, so these are not indications of any RHEL 8 specific issue. The -100 exit code means timeout. |
@janvorli I ran pri1 tests. Can you see if there is something interesting in the results below? Based on the summary table, these are the additional failures:
Full summary table:
This is the full log of the test run: pri1.log. |
The failures with asserts |
Yes, it has 64kB pages. |
@janvorli I assume it is very likely this is the root cause for the NullReferenceExceptions? Thank you for your help! |
It could theoretically be the case. To be sure, you can experimentally rebuild the RHEL 8 kernel with page size set to 4kB, run RHEL 8 with it and see if it fixes the issue. |
@janvorli looks like some linkers have a flag for this.
|
@tmds thank you, that sounds great! I'll give it a try. |
@janvorli have you looked at this issue further? |
I have tried to use the --rosegment, but the default linker doesn't support that option. Only ld-gold and lld do. I will try to switch the linker to lld, I wanted to do that a long time ago anyways. |
I installed lld on that ARM64 machine too, so it should be ready whenever you are. |
@janvorli are you making some progress on this issue? |
I am sorry for the delay. I have just tried to enable linking using lld and just that fixes the issue, as lld by default puts rodata into a non-text segment. I will send out a PR soon. |
The null reference exception on Apple Silicon were mostly resolved by Apple macOS fixes. They just went away when I updated my machine to macOS 11.3 Beta 6. There could be a few still lingering, but I haven't identified them yet. Our CI machines were just updated to macOS 11.3. I am in the process of reenabling the previously failing tests. I'll have a better idea if there are any other lingering issues when those are reenabled and we run for a few days/weeks. Your observation that it is likely kernel related seems believable.... |
@sdmaclea maybe @janvorli figures something out when he takes a look. |
I was guessing
Basically same opinion as @janvorli ("It seems it might be related to something with capturing / restoring context around GC suspension, the FlushProcessWriteBuffers not working or something of that kind.")
No. It also looks like it might have only been a one of many issues. The Apple Silicon CI macOS upgrade improved pass rate, but I still see these null reference exceptions in CI (but not on my local machine). I am going through the differences to see if I can get CI to match my local experience. |
@janvorli without thinking much about it I asked our CI to build your branch. The build doesn't work because the SDK that gets downloaded to perform the build still has the rodata in the wrong segment and crashes. Next week, I'll try to build libcoreclr separately and patch the build SDK. I'm puzzled why other arm64 distros don't have an issue with the |
@tmds I believe the issue doesn't occur if you have 4kB large memory pages, only when the distro has larger pages, the block with the cookie "leaks" into code. |
Yes. The cookie issue still causes our builds to fail from the start. Our plan is to build .NET 6 for arm64, but this issue needs to be resolved for that. I've looked at the problem but I couldn't figure out the root cause. I think it is in the kernel. |
@janvorli Now that preview6 is wrapping up, any idea on when you'll be able to take another look at this? |
I have created a PR in arcade to fix rootfs build for Alpine 3.9. After consulting it with @mthalman, I am going to get in my original change to the docker images and keep building for Alpine on 3.9 for now and move to using the 3.13 after the preview 7. Then I can get in my change to use the lld linker and start looking into the null reference issues. We still have null reference issues on Apple Silicon, so chances are they are related. |
@omajid, @tmds I have tried to run all coreclr pri 1 tests on RHEL 8 with 64kB page size using the latest main and no tests were failing with NullReferenceException anymore. |
Thanks, @janvorli ! Any idea when a fix might land such that building runtime works out of the box? Maybe in a month or so? |
I'm not sure you're running tests in a way that shows the The |
I believe the NullReferenceException was fixed by another change, #53510. That was what was causing those on macOS arm64 and it was not Apple specific. |
The fix will be part of RC1, which will come after preview 7. |
Great! Thank you for the reference. |
In our CI builds, each run on RHEL8 arm64 shows
NullReferenceExceptions
in the log.On the same arm64 host with a Fedora 32 VM there are no
NullReferenceExceptions
.When I build and test on another RHEL8 arm64 machine,
NullReferenceExceptions
also show up in unexpected places.Some example stack traces from CI log:
Microsoft.Extensions.Hosting tests
System.Linq.Parallel.Tests
System.Text.Json.Serialization.Tests
@janvorli I don't know how to debug this, can you take a look? or give me some pointers?
cc @omajid
The text was updated successfully, but these errors were encountered: