Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JDK21 sanity.openjdk failures on x86-64_linux on some machines #5012

Closed
smlambert opened this issue Jan 29, 2024 · 14 comments
Closed

JDK21 sanity.openjdk failures on x86-64_linux on some machines #5012

smlambert opened this issue Jan 29, 2024 · 14 comments
Assignees

Comments

@smlambert
Copy link
Contributor

smlambert commented Jan 29, 2024

Some test cases from jdk21 sanity.openjdk targets are failing on x86-64_linux on certain machines. The failures are in the jdk_lang, jdk_util and jdk_foreign test targets, details can be found in the Jan CPU JDK21 AQA triage, see #4983 (comment)


sanity.openjdk - 5 targets fail, jdk_lang_0, jdk_lang_1, jdk_util_0, jdk_util_1 and jdk_foreign_0 - rerun in Grinder/8539 fail, rerun on test-docker-ubuntu2204-x64-1 rerun in Grinder/8557 passes
Grinder_20240118175328_jdk21_x64Linux.tap.txt


Marking https://ci.adoptium.net/job/Test_openjdk21_hs_sanity.openjdk_x86-64_linux/151/ as a keep forever to show a good passing run on test-docker-fedora35-x64-1

Failing in https://ci.adoptium.net/job/Test_openjdk21_hs_sanity.openjdk_x86-64_linux/153/ with lots of testcases timing out on test-docker-fedora37-x64-3, with lots of 03:32:59 ACTION: testng -- Failed. Unexpected exit from test [exit code: 137] and Error. Agent communication error: java.net.SocketException: Broken pipe; check console log for any additional details issues

Failing in https://ci.adoptium.net/job/Test_openjdk21_hs_sanity.openjdk_x86-64_linux/155/ on test-docker-fedora35-x64-1 with same types of issues as from the Test_openjdk21_hs_sanity.openjdk_x86-64_linux/153 above.

List of failing test cases from Test_openjdk21_hs_sanity.openjdk_x86-64_linux/155:

		java/lang/Package/bootclasspath/GetPackageFromBootClassPath.java.GetPackageFromBootClassPath
		
		java/lang/StackWalker/LocalsAndOperands.java#id1.LocalsAndOperands_id1
		
		java/lang/StrictMath/SqrtTests.java.SqrtTests
		
		java/lang/String/concat/ImplicitStringConcatManyLongs.java.ImplicitStringConcatManyLongs
		
		java/lang/Thread/virtual/stress/Skynet.java#default.Skynet_default
		
		jdk/internal/vm/Continuation/BasicExt.java#COMP_WINDOW_LENGTH_2.BasicExt_COMP_WINDOW_LENGTH_2
		
		java/util/Arrays/TimSortStackSize2.java.TimSortStackSize2
		
		java/util/LinkedHashMap/Basic.java.Basic
		
		java/util/LinkedHashSet/Basic.java.Basic
		
		java/util/List/NestedSubList.java.NestedSubList
		
		java/util/Locale/Bug4152725.java.Bug4152725
		
		java/util/Locale/Bug6989440.java.Bug6989440
		
		java/util/Locale/bcp47u/CurrencyTests.java.CurrencyTests
		
		java/util/StringJoiner/StringJoinerTest.java.StringJoinerTest
		
		java/util/concurrent/tck/JSR166TestCase.java#others.JSR166TestCase_others
		
		java/util/jar/JarFile/jarVerification/MultiProviderTest.java.MultiProviderTest
		
		java/util/jar/JarFile/mrjar/MultiReleaseJarHttpProperties.java.MultiReleaseJarHttpProperties
		
		java/util/jar/JarFile/mrjar/MultiReleaseJarProperties.java.MultiReleaseJarProperties
		
		java/util/jar/Manifest/ValueUtf8Coding.java.ValueUtf8Coding
		
		java/lang/StackWalker/LocalsAndOperands.java#id0.LocalsAndOperands_id0
		
		java/lang/String/concat/ImplicitStringConcatShapes.java.ImplicitStringConcatShapes
		
		java/util/StringJoiner/MergeTest.java.MergeTest
		
		java/util/concurrent/forkjoin/AsyncShutdownNowInvokeAnyRace.java.AsyncShutdownNowInvokeAnyRace
		
		java/util/concurrent/forkjoin/Integrate.java.Integrate
		
		java/util/stream/test/org/openjdk/tests/java/util/stream/mapMultiOpTest.java.mapMultiOpTest
		
		java/util/zip/ZipFile/TestZipFileEncodings.java.TestZipFileEncodings
		
		java/foreign/TestLargeSegmentCopy.java.TestLargeSegmentCopy
		
		java/lang/String/concat/ImplicitStringConcatOOME.java.ImplicitStringConcatOOME
		
		java/util/BitSet/stream/BitSetStreamTest.java.BitSetStreamTest
		
		java/util/HashMap/WhiteBoxResizeTest.java.WhiteBoxResizeTest
		
		java/util/HexFormat/HexFormatTest.java.HexFormatTest

@smlambert
Copy link
Contributor Author

Deep history view looks like:

Screenshot 2024-01-29 at 2 08 28 PM

indicating that something happened between Dec 16 and 23 (either in test material or machine configuration to have this issue arise).

@adamfarley
Copy link
Contributor

adamfarley commented Jan 30, 2024

Ok, the following unit tests seem to fail uniquely on the new machine (test-docker-fedora39-x64-1), but passed when run on other fedora machines in the past (example1, example2).

sun/security/krb5/MicroTime
java/util/Locale/bug4122700
java/util/Map/InPlaceOpsCollisions
java/util/ResourceBundle/Bug6355009
java/util/Scanner/ScanTest

The Grinder re-run Stewart launched tells us that all of those were infrequent failures, as they passed when rerun on the same machine.

In short, I see no consistent failures in sanity.openjdk that have not occurred on existing Fedora machines in the past month, so there's nothing uniquely wrong with this new Fedora machine.

@smlambert
Copy link
Contributor Author

@adamfarley - please continue to investigate this issue to understand what has changed between Dec 16 and 23 that introduces these failures.

  • are the failing tests new or changed?
  • has there been a change to the machine configuration or the underlying docker host machine?
    and so forth

@smlambert smlambert moved this from Todo to In Progress in 2024 1Q Adoptium Plan Jan 30, 2024
@adamfarley
Copy link
Contributor

Will do.

@adamfarley
Copy link
Contributor

adamfarley commented Feb 1, 2024

Summary

I think this could be a memory issue caused by a massive concurrency spike.

Details

Ok, I've taken a look at the failing test targets, and I'm seeing a 2-3x increase in the runtime on Fedora, along with the most consistent "failed" test targets on Fedora as well. Ubuntu and centos seem to pass the test targets at least some of the time, so I'm concentrating on Fedora for the first pass.

Since 137 is the Linux code for processes being killed due to using up too much memory, this may also explain the socket error if, hypothetically, the VM we're trying to "get" has also been killed.

Still, that's a hypothesis, not proof, so let's see what changed during the Dec 16 and 23 period that could affect memory usage and/or networking.

The first thing I noticed was that the failures have a concurrency rating of 25 and the passes have a concurrency rating of 3. This would explain the timeouts "getting" a VM, and also the memory problems.

Will kick off some tests with a reduced concurrency. If that solves the issue, let's tweak the code to make sure this doesn't happen again.

EDIT: Ok, this isn't working. Will rerun with the fix proposed below.

@adamfarley
Copy link
Contributor

@adamfarley
Copy link
Contributor

adamfarley commented Feb 5, 2024

Ok, I'm confident that the "Potential candidate" is the cause for our headache here.

The code there was originally designed to ensure that the concurrency could not exceed the memory limits of the machine in question.

However, that code change made us use "megabytes" of memory instead of "gigabytes", so as long as the number of processors is less than the number of megabytes of memory, we use (cores/2)+1 concurrency. Since this machine has 6.5gb of memory and 48 cores, this results in us setting concurrency to 25 (aka (48/2)+1).

So where CORE is now 25 and MEM is the number of megabytes of memory the system has:

ifeq ($(shell expr $(CORE) \> $(MEM)), 1)
	CONC := $(MEM)

So we go from guaranteeing each core 2gb of memory, to guaranteeing 0.26gb per core.

I think this code change needs to be adjusted to avoid starving threads of memory.

Will put together a fix and run it past sxa.

Test run: https://ci.adoptium.net/job/Test_openjdk21_hs_sanity.openjdk_x86-64_linux/162/

@adamfarley
Copy link
Contributor

adamfarley commented Feb 5, 2024

Ok, there appears to be a bug preventing test pipeline jobs from running anything from personal forks, so here's a grinder:

https://ci.adoptium.net/job/Grinder/8715/

@adamfarley
Copy link
Contributor

The grinder successfully reduces the number of concurrent threads in the test. Now to see if reducing the concurrency reduces the number of failures. Leaving to run.

@sophia-guo
Copy link
Contributor

Another note is Deep history shows it mainly failed on the docker agents.
Screenshot 2024-02-05 at 12 01 23 PM

Jan 30th, 2024 run on test-ibmcloud-ubuntu1604-x64-1 succeeds. The PR #5035 might also be a part of fix , which reduces the NPROCS and hence if $(CORE) < $(MEM) we won't hit the issue. Only if $(CORE) > $(MEM) the MEM calculation might be an issue.

@sxa
Copy link
Member

sxa commented Feb 12, 2024

Potential candidate for concurrency spike trigger.

Seems likely. I /think/ the correct code should be:

else CGMEM=`expr $${KMEMMB} \* 1048576`; fi; CGMEMMB=`expr $${CGMEM} / 1048576`;`

So that original PR wasn't correct for all situations

@adamfarley
Copy link
Contributor

adamfarley commented Feb 14, 2024

Ok, here's the PR: #5063

Update 2024/02/16 - Merged

@adamfarley
Copy link
Contributor

Ok, the concurrency level should now be within the bounds of sanity. Will kick some jobs off.

@smlambert
Copy link
Contributor Author

Closed via #5063

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

No branches or pull requests

4 participants