-
Notifications
You must be signed in to change notification settings - Fork 728
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JDK20 jdk/incubator/concurrent/ScopedValue/StressStackOverflow.java StressStackOverflow$TestFailureException: Unexpected value for ScopedValue #16965
Comments
|
I modified |
@thallium Open a draft PR with your implementation so it can be reviewed. Compare your implementation with the pseudo-code in #16677 (comment). Can you provide more details for the failure at |
It's not a fix, just a hack to check if it's related to
Yes. @babsingh |
@thallium Do you have free cycles to implement the pseudo-code in #16677 (comment)? If yes, refer to the below code to implement the pseudo-code. openj9/runtime/jcl/common/jdk_internal_misc_ScopedMemoryAccess.cpp Lines 93 to 106 in ab8319b
|
@babsingh Yes I'm working on this |
Ran a 20x grinder with
|
Ran the test 20x locally with |
re #16965 (comment): Thanks, @thallium.
|
@babsingh Not too much I can find. The test tests whether the scoped value can be recovered correctly after a stack overflow. In each recursive call the scoped value |
JVM option to create core files during Java stack when the exception was thrown; the top frames are JIT'ed + inlined:
@thallium To investigate
Based on the code in ScopedValue.scopedValueBindings, this issue might still be related to |
Got seg fault when trying to print the value of the scoped value:
Stack trace from gdb:
|
fyi @0xdaryl A segfault in the JIT is noticed while debugging this failure. Please see #16965 (comment) for more details. |
@nbhuiyan : inliner crash above. Please investigate for 20.0.2. |
It is probably worth opening a separate JIT issue for the seg fault to divorce it from the original issue being investigated, unless anyone feels otherwise. |
Original issue: #16965 (comment) The original issue goes away with |
|
Removed |
@babsingh this issue might require some insights from the VM side. The reason why we see the problem during inlining is because of an invalid entry in the invokecache array for an invokedynamic bytecode. This problem is seen when trying to obtain the address of the appendix object, which returns |
With AOT disabled, and all optimizations beyond inlining disabled, opt level restricted to warm, and only this method below getting compiled, the failure is reproducible:
With only just changing a line in the test case to print the scoped value immediately prior to throwing the As a result, we can now look at fairly simple full compilation logs for passing and and a failing cases that result in identical compiled code (other than the relative addresses, symref numbers, etc):
Using a diff tool would reveal that the generated code for both passing and failing cases were identical, which suggests that the problem may lie outside of what the JIT compilations were responsible for. As for the differences in the number of calls to the
|
@fengxue-IS as requested, here are the steps to reproduce what I did:
The limitfile option requires that the file is in the current directory, or you would need to specify an absolute path. The parameters |
Now, the test has four sub-variants, which need to be individually excluded. Related: eclipse-openj9/openj9#18065 Related: eclipse-openj9/openj9#16965 Signed-off-by: Babneet Singh <[email protected]>
Now, the test has four sub-variants, which need to be individually excluded. Related: eclipse-openj9/openj9#18065 Related: eclipse-openj9/openj9#16965 Signed-off-by: Babneet Singh <[email protected]>
I did some testing with modified java code to init a new Exception in place rather than pre-init
based on the exception location, it seems to suggest the issue is triggered only in finally blocks, looking into the corefile when the exception is generated, the value of This suggests that the way @nbhuiyan can you verify the code generated for the finally block please |
The generated code for the finally block (which is actually split into two paths based on whether an exception was thrown or not) looks correct to me. However, with further testing by modifying the code, I found that the finally block may not always execute, which is not the correct behaviour. I modified private void runWith(Snapshot newSnapshot, Runnable op) {
boolean exceptionThrown = false;
try {
Thread.setScopedValueBindings(newSnapshot);
Thread.ensureMaterializedForStackWalk(newSnapshot);
ScopedValueContainer.run(op);
} catch (Exception e) {
System.out.println("Caught exception in runWith...");
exceptionThrown = true;
throw e;
} finally {
Reference.reachabilityFence(newSnapshot);
Thread.setScopedValueBindings(newSnapshot.prev);
Cache.invalidate(bitmask);
if (exceptionThrown) System.out.println("Exception was caught and rethrown, and finally stuff was run.");
}
} The above code should result in printing both the print statements in the catch block and the finally block each time an exception was thrown, but in my testing I found cases of the message in the catch block being printed without the message in the finally block, suggesting that somehow we skip running the the finally block code. Here is an example of an output that suggests the finally block may be skipped:
I am looking into what's causing this behaviour. |
A couple of observations on this. This test fails reliably at noopt and hasn't been reproduced with -Xint. Noopt failures are usually indicative of either a codegen problem, a timing problem somewhere (e.g., somewhere in the VM or a data race with the test case), or some other problem where a token JIT method is required somewhere on the stack. This test fails readily on both x86 and AArch64 which makes it much less likely to be a codegen problem unless the IL is messed up (which I don't think it is). I am always wary of OpenJDK tests that expect certain behaviour in the presence of forced stack overflows (mainly because they're tuned for OpenJDK behaviour and not OpenJ9). That doesn't mean the test is incorrect, but it does raise some suspicion for me. The behaviour of the test itself is driven in part by random choices within. I am presently removing those to produce a test with the most straight-line path to failure to focus the investigation. |
@0xdaryl Any new news on pruning the test? |
Now, the test has four sub-variants, which need to be individually excluded. Related: eclipse-openj9/openj9#18065 Related: eclipse-openj9/openj9#16965 Signed-off-by: Babneet Singh <[email protected]>
Now, the test has four sub-variants, which need to be individually excluded. Related: eclipse-openj9/openj9#18065 Related: eclipse-openj9/openj9#16965 Signed-off-by: Babneet Singh <[email protected]>
Now, the test has four sub-variants, which need to be individually excluded. Related: eclipse-openj9/openj9#18065 Related: eclipse-openj9/openj9#16965 Signed-off-by: Babneet Singh <[email protected]>
Simplifying the test has helped focus the investigation on the ScopedValue/ScopedValueContainer implementations (as opposed to the test itself) which is where we're looking now. Analysis is proceeding at no-opt. The test fails readily at default no-opt but failures quickly go into remission when attempting to get more diagnostic info out or limiting it to a small subset of methods. Since it fails at no-opt, no workaround is available (yet) other than disabling the JIT. I'm taking a much deeper look at the |
I'm going through the ScopedValue and ScopedValueContainer code to understand what it is doing under the covers. Fortunately for this failure, only one thread is involved which simplifies the analysis. This failure has the feel of a data race given how it fails and how the problem can be made to go into remission with subtle changes to the test, but that seems unlikely with a single thread. |
We now have a core file with a verbose stackwalk trace at the point the TestFailureException is thrown with all methods compiled at noopt that we are sorting through. |
No major insights yet from the analysis. In parallel, we attempted to collect logs and a core file from the original, unmodified test just to be sure we did not deviate too far from the original problem, but we have been unsuccessful in obtaining any failure artifacts on the original test. We'll continue to work with the reduced test. |
This particular test relies on artificially induced stack overflows to exercise unwinding the stacked scopes and verifying this is done correctly. Stack overflows occur frequently in this test (it is a stress test after all!), which exercises the unwinding operation a lot. However, stack overflows can be difficult to control and may not occur where the test expects them to (or where OpenJDK VM would perform them). Their non-deterministic behaviour is fueled by the presence of JIT compiled methods that are compiled asynchronously and may have different frame shapes between compilations. If a stack overflow were to occur during a critical update operation that modifies the stack scopes then this could lead to inconsistent behaviour. This isn't just a theory. For example, by instrumenting the class library, I have found instances where the ScopedValueContainer "push()" operation is interrupted by a stack overflow, thereby not completing its operation. While in that particular case I believe (through inspection) that the state should be consistent, it does raise the possibility that a stack overflow (which we know there are a great many) is occurring somewhere unexpected during a critical operation and leaving the representation of the stacked ScopedValues in an inconsistent state. Confirming this has been challenging because the problem tends to go into remission with even minor changes the the test or JCL. Up to this point, for this failure I have found no evidence that the VM or JIT is misbehaving. Because this test is heavily exercising the ScopedValue feature in an unnaturally stressful way, and the fact that ScopedValues are a preview feature in JDK21, and the test itself is somewhat suspect, I don't think this should block JDK 21 release. |
Moved it forward. |
I should point out that only a single thread is needed to reproduce this behaviour. Virtual threads are not required either. |
Moving this to the "Future" milestone. It's something that needs to be looked at, but I suspect no one will be able to get to the bottom of it for an upcoming release. |
Did we discover why the finally block doesn't always run (as described in #16965 (comment)) ? |
@JamesKingdon I did not continue the investigation down that path to understand why we may be skipping the finally block, or whether this behaviour can be reproduced in any other test where the test is not artificially inducing lots of stack overflows. |
Failure link
From an internal build(
ub18-aarch64-8
):Rerun in Grinder - Change TARGET to run only the failed test targets.
Optional info
Failure output (captured from console output)
Also appeared in other platforms such as x86-64_linux s390x_linux
The text was updated successfully, but these errors were encountered: