Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test-azure-win2012r2-x64-2 / test-azure-win2016-x64-1: openj9 SharedClasses.xxx tests fail (Memory issue?) #1963

Closed
lumpfish opened this issue Feb 22, 2021 · 23 comments
Assignees
Milestone

Comments

@lumpfish
Copy link

The following openj9 shared classed test targets may fail when they land on test-azure-win2012r2-x64-2 or test-azure-win2016-x64-1.

SharedClassesAPI
SharedClasses.SCM01.MultiCL
SharedClasses.SCM01.MultiThread
SharedClasses.SCM01.MultiThreadMultiCL
SharedClasses.SCM23.MultiCL
SharedClasses.SCM23.MultiThread
SharedClasses.SCM23.MultiThreadMultiCL

The symptoms are various out of memory exceptions - e.g.

11:52:21  MT4 stderr JVMDUMP032I JVM requested Snap dump using 'C:\Users\jenkins\workspace\Grinder\openjdk-tests\TKG\output_1613993216963\SharedClasses.SCM23.MultiThread_1\20210222-114232-SharedClasses\results\Snap.20210222.114552.7872.0004.trc' in response to an event
11:52:21  MT4 stderr JVMDUMP010I Snap dump written to C:\Users\jenkins\workspace\Grinder\openjdk-tests\TKG\output_1613993216963\SharedClasses.SCM23.MultiThread_1\20210222-114232-SharedClasses\results\Snap.20210222.114552.7872.0004.trc
11:52:21  MT4 stderr JVMDUMP013I Processed dump event "systhrow", detail "java/lang/OutOfMemoryError".
11:52:21  MT4 stderr Exception in thread "main" java.lang.OutOfMemoryError: Failed to create a thread: retVal -1073741830, errno 22
11:52:21  MT4 stderr 	at java.lang.Thread.startImpl(Native Method)
11:52:21  MT4 stderr 	at java.lang.Thread.start(Thread.java:993)
11:52:21  MT4 stderr 	at net.openj9.test.sc.LoaderSlaveMultiThread.run(LoaderSlaveMultiThread.java:130)
11:52:21  MT4 stderr 	at net.openj9.test.sc.LoaderSlaveMultiThread.main(LoaderSlaveMultiThread.java:59)

Their Jenkins links show the machines have 4Gb RAM:
https://ci.adoptopenjdk.net/computer/test-azure-win2012r2-x64-2/ - Failed
https://ci.adoptopenjdk.net/computer/test-azure-win2016-x64-1/ - Failed

The links for two other machines also show them as having 4Gb memory, but the tests pass on those machines:
https://ci.adoptopenjdk.net/computer/test-azure-win2012r2-x64-1/ - Passed
https://ci.adoptopenjdk.net/computer/test-azure-win2012r2-x64-3/ - Passed

@sxa
Copy link
Member

sxa commented Feb 22, 2021

Seems likely related to the memory on those machines. Next steps should probably be to verify the swap file settings, whether they can be increased with any effect, and if not we should look to increase the RAM on those systems to 6GB first, then 8GB if that doesn't work.

@karianna
Copy link
Contributor

Seems likely related to the memory on those machines. Next steps should probably be to verify the swap file settings, whether they can be increased with any effect, and if not we should look to increase the RAM on those systems to 6GB first, then 8GB if that doesn't work.

Could also be filehandles.

@karianna karianna added this to the February 2021 milestone Feb 23, 2021
@karianna karianna removed this from the February 2021 milestone Feb 23, 2021
@sxa
Copy link
Member

sxa commented Feb 23, 2021

Could also be filehandles.

What determines available file handles on a per-machine basis? Is that in any way a default set on RAM size or something else?

@sxa
Copy link
Member

sxa commented Feb 23, 2021

(I've disabled the win2016 system by removing ci.role.test until this can be debugged/diagnosed)

@karianna
Copy link
Contributor

Could also be filehandles.

What determines available file handles on a per-machine basis? Is that in any way a default set on RAM size or something else?

On Windows? I've actually got no idea.

@sxa
Copy link
Member

sxa commented Feb 24, 2021

Testing here with swap space increased on test-azure-win2016-x64-1 (assuming it goes live without a reboot) If that doesn't work I'll increase the RAM to 6Gb

@sxa
Copy link
Member

sxa commented Feb 24, 2021

Hmmm 2012r2-2 has 16GB of RAM. Running a Grinder on there too to verify

@sxa
Copy link
Member

sxa commented Feb 24, 2021

So the Grinder on the win2016 box failued but not with an obvious memory issue - @lumpfish can you check the log of that one to see if it's the same issue you've seen?

The win2012r2 did give an OutOfMemoryException - have made sure there is up to 12GB of swap and am re-running in this grinder

@sxa sxa changed the title test-azure-win2012r2-x64-2 / test-azure-win2016-x64-1: openj9 SharedClasses.xxx tests fail test-azure-win2012r2-x64-2 / test-azure-win2016-x64-1: openj9 SharedClasses.xxx tests fail (Memory issue?) Feb 24, 2021
@sxa sxa self-assigned this Feb 24, 2021
@sxa
Copy link
Member

sxa commented Feb 25, 2021

Win2012 machine showed an OutOfMemory during one of the tests (different one in each run) in 7231 and 7237 I'm going to restart it, run the same test again while trying to watch the usage live on the machine and then see how easy it is to increase to 6GB ([EDIT: no I won't as Azure doens't have 6GB options so it'll have to be 8GB which is almost twice the cost unfortunately ... Maybe I'll just shut down the 2012 one and bump the 2016 up to 8GB B2ms spec)

@lumpfish
Copy link
Author

lumpfish commented Feb 25, 2021

So the Grinder on the win2016 box failued but not with an obvious memory issue - @lumpfish can you check the log of that one to see if it's the same issue you've seen?

That test is similar in that it runs multiple jvms in parallel which share a shared class cache.

The stderr from the failing process (found by downloading the system_test_output.tar.gz file from the failing job (https://ci.adoptopenjdk.net/job/Grinder/7230/) ) contains:

JVMSHRC162E The wait for the creation mutex while opening shared memory has timed out
JVMSHRC662I Error recovery: destroyed semaphore set associated with shared class cache.
JVMSHRC840E Failed to start up the shared cache.
JVMJ9VM015W Initialization error for library j9shr29(11): JVMJ9VM009E J9VMDllMain failed
Error: Could not create the Java Virtual Machine.
Error: A fatal exception has occurred. Program will exit.

I've not seen (or noticed) that before.

@sxa
Copy link
Member

sxa commented Feb 25, 2021

Hmmm https://ci.adoptopenjdk.net/job/Grinder/7260/ ran through without any failure on azure-win2012r2-2 after an earlier reboot.

Although trying again and this has popped up:
image
Upgrade time then! (FYI @smlambert looks like Windows tests can't complete on a 4GB Windows system)

@sxa
Copy link
Member

sxa commented Feb 25, 2021

I've shut the Windows2012 machine down (it's also more expnsive than the new ones I've set up so shutting it down isn't a bad idea). I'm re-running a Grinder on the 2016 machine 7268 since the previous one passed, and I'll look to bumping it up to 8Gb if it fails (Will still be cheaper than the Win2012 one) [EDIT: 7268 passed - running again on the 4GB Win2016 box at 7277 and 7278

Side note: I'm also running a grinder on one of the larger 2012 boxes at 7269 - mostly because I'm curious as to whether there are any performance differences on that one (But I suspect on the system test suites it won't make much difference)

@sxa
Copy link
Member

sxa commented Feb 25, 2021

7277 failed a test but did not through a visible OutOfMemory error so inconclusive

@lumpfish
Copy link
Author

7277 failed with the same mutex wait error:

JVMSHRC162E The wait for the creation mutex while opening shared memory has timed out
JVMSHRC662I Error recovery: destroyed semaphore set associated with shared class cache.
JVMSHRC840E Failed to start up the shared cache.
JVMJ9VM015W Initialization error for library j9shr29(11): JVMJ9VM009E J9VMDllMain failed
Error: Could not create the Java Virtual Machine.
Error: A fatal exception has occurred. Program will exit.

@sxa
Copy link
Member

sxa commented Feb 25, 2021

Despite the above tests being inconclusive due to the failure on shared class setup, I'm going to go ahead with

Converted test-azure-win2016-x64-1 from B2s (left) to B2ms (right). Back online with ci.role.test label and queued up two Grinders 7288 and 7299 - hopefully that will resolve the OutOfMemoryErrors if not the class cache issue.

image

@sxa
Copy link
Member

sxa commented Mar 1, 2021

I'm going to deprovision https://ci.adoptopenjdk.net/computer/test-azure-win2012r2-x64-2/ (test-2012r2-2 on the azure portal) - we can recreate it if required in the future but it's unfit for purpose in its current state and cannot easily be converted to a cost effective larger system.

@sxa
Copy link
Member

sxa commented Mar 1, 2021

7288 failed but https://ci.adoptopenjdk.net/job/Grinder/7301/ succeeded - @lumpfish can you take a look at 7288 and let me know if you're concerned about the failure (in terms of whether it could still be a machine specific one-off)

@Haroon-Khel Haroon-Khel modified the milestones: March 2021, April 2021 Apr 6, 2021
@github-actions github-actions bot added arch:x64 openj9 Stuff specific to OpenJ9 labels Apr 6, 2021
@Haroon-Khel Haroon-Khel modified the milestones: April 2021, May 2021 May 18, 2021
@Haroon-Khel Haroon-Khel modified the milestones: May 2021, June 2021 Jun 21, 2021
@sxa
Copy link
Member

sxa commented Jun 30, 2021

Updated links to re-run:

@Haroon-Khel Haroon-Khel modified the milestones: June 2021, August Aug 2, 2021
@sxa sxa modified the milestones: October 2021, December 2021 Dec 1, 2021
@sxa sxa modified the milestones: December 2021, 2022-01 (January) Jan 6, 2022
@sxa sxa modified the milestones: 2022-03 (March), Backlog May 24, 2022
@sxa
Copy link
Member

sxa commented Feb 6, 2023

Re-runs:

@sophia-guo
Copy link

We don't run impl=openj9 tests in adoptium , so can win2016 be enabled?

@sxa
Copy link
Member

sxa commented Nov 5, 2024

Closing as this is OpenJ9 specific and was failing on two machines that have been decommissioned

@sxa sxa closed this as completed Nov 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants