Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove aarch64Alpine from default jdk11u pipeline config due to test hangs reliability #349

Merged
merged 1 commit into from
Jul 1, 2022

Conversation

andrew-m-leonard
Copy link
Contributor

@andrew-m-leonard andrew-m-leonard commented Jun 29, 2022

Due to nearly always hanging during jdk11u test jobs, the aarch64Alpine platform is being removed from the default nightly jdk11u pipeline config.
See: adoptium/aqa-tests#3799

Signed-off-by: Andrew Leonard [email protected]

@karianna
Copy link
Contributor

Do we understand why? Was this a platform we intended to release in July PSU?

@github-actions
Copy link

Thank you for creating a pull request!

Please check out the information below if you have not made a pull request here before (or if you need a reminder how things work).

Code Quality and Contributing Guidelines

If you have not done so already, please familiarise yourself with our Contributing Guidelines and Code Of Conduct, even if you have contributed before.

Tests

Github actions will run a set of jobs against your PR that will lint and unit test your changes. Keep an eye out for the results from these on the latest commit you submitted. For more information, please see our testing documentation.

In order to run the advanced pipeline tests (executing a set of mock pipelines), I require an admin to post run tests on this PR.
If you are not an admin, please ask for one's attention in #infrastructure on Slack or ping one here.

@andrew-m-leonard
Copy link
Contributor Author

Do we understand why? Was this a platform we intended to release in July PSU?

@karianna aarch64Alpine is not for July
We don't understand why currently, it's being consistently hanging during tests in different locations, @Haroon-Khel logged on to check the hanging processes, but they just showed the processes in a "Wait" state.

@sxa
Copy link
Member

sxa commented Jun 30, 2022

@andrew-m-leonard I thought this was only affecting JDK11? Have the problems been seen on other version?

@andrew-m-leonard
Copy link
Contributor Author

@sxa yes, seen on jdk17u as well.

I'm just looking at the current running last night's tests, and so far they are all still running.... so wondering if something may have fixed it...?
Interesting all our arm nodes have also been out, and also Equinix's Ed's comment maybe? https://adoptium.slack.com/archives/C53GHCXL4/p1656526042724669

@sxa
Copy link
Member

sxa commented Jun 30, 2022

jdk17u

Got a link? https://ci.adoptopenjdk.net/job/build-scripts/job/jobs/job/jdk17u/job/jdk17u-alpine-linux-aarch64-temurin/ doesn't look like it's experienced any hangs on a initial look

@andrew-m-leonard
Copy link
Contributor Author

jdk17u

Got a link? https://ci.adoptopenjdk.net/job/build-scripts/job/jobs/job/jdk17u/job/jdk17u-alpine-linux-aarch64-temurin/ doesn't look like it's experienced any hangs on a initial look

adoptium/aqa-tests#3806

@andrew-m-leonard
Copy link
Contributor Author

run tests

@sxa
Copy link
Member

sxa commented Jun 30, 2022

jdk17u

Got a link? https://ci.adoptopenjdk.net/job/build-scripts/job/jobs/job/jdk17u/job/jdk17u-alpine-linux-aarch64-temurin/ doesn't look like it's experienced any hangs on a initial look

adoptium/aqa-tests#3806

I'm not sure what I'm reading in there. That is in relation to pipeline 34 which appearred to fail somewhere in the GPG signing step (that job has now been deleted so I can't look into it).

There was a comment which links to your issue adoptium/aqa-tests#3799

Are you sure that any delay in that pipeline wasn't just a hold up caused by the executors being held up by JDK11 jobs?

All I can see from pipeline 34's subjobs (NOTE: It's a weekly pipeline so would not nceessarily be directly comparable to the others) was https://ci.adoptopenjdk.net/job/Test_openjdk17_hs_extended.openjdk_aarch64_alpine-linux_testList_1/ which you killed #3 from even though it looks like it hadn't got to the end (NOTE that that the previous runs of that job took longer than the 4h4 it was at when it was terminated)

@andrew-m-leonard
Copy link
Contributor Author

You maybe right being only a jdk11u, although jdk19 Smoke test has hung this morning
The above jdk17u hang I thought @sophia-guo had seen during triage?

@sxa
Copy link
Member

sxa commented Jun 30, 2022

You maybe right being only a jdk11u, although jdk19 Smoke test has hung this morning

To be 100% clear, that was a smoke test hang on x64, not aarch64, so not relevant to this PR and from the look of it it's not new as it's hit the timeout in all runs visible on https://ci.adoptopenjdk.net/job/build-scripts/job/jobs/job/jdk19/job/jdk19-alpine-linux-x64-temurin_SmokeTests/

@andrew-m-leonard
Copy link
Contributor Author

You maybe right being only a jdk11u, although jdk19 Smoke test has hung this morning

To be 100% clear, that was a smoke test hang on x64, not aarch64, so not relevant to this PR.

Not sure we can totally rule that out, both Alpine

@eclipse-temurin-bot
Copy link
Collaborator

 PR TESTER RESULT 

❎ Some pipelines failed or the job was aborted! ❎
See the pipeline-build-check below for more information...

@sxa
Copy link
Member

sxa commented Jun 30, 2022

Not sure we can totally rule that out, both Alpine

Not totally, but JDK19 on x64 has had a 100% hang rate recently, aarch64 has not failed recently.

@sophia-guo
Copy link
Contributor

You maybe right being only a jdk11u, although jdk19 Smoke test has hung this morning The above jdk17u hang I thought @sophia-guo had seen during triage?

Yes, I think it is same for jdk17, test jobs are timeout and abort.
Test_openjdk17_hs_extended.openjdk_aarch64_alpine-linux ❌ ABORTED ❌

Test_openjdk17_hs_extended.openjdk_aarch64_alpine-linux_testList_0 ❌ ABORTED ❌
jdk_security3_2 => deep history 0/1 passed | possible issues |

Test_openjdk17_hs_extended.openjdk_aarch64_alpine-linux_testList_1 ❌ ABORTED ❌
jdk_security4_0 => deep history 0/1 passed | possible issues |

@sxa
Copy link
Member

sxa commented Jul 1, 2022

You maybe right being only a jdk11u, although jdk19 Smoke test has hung this morning The above jdk17u hang I thought @sophia-guo had seen during triage?

Yes, I think it is same for jdk17, test jobs are timeout and abort. Test_openjdk17_hs_extended.openjdk_aarch64_alpine-linux ❌ ABORTED ❌

Test_openjdk17_hs_extended.openjdk_aarch64_alpine-linux_testList_0 ❌ ABORTED ❌ jdk_security3_2 => deep history 0/1 passed | possible issues |

Test_openjdk17_hs_extended.openjdk_aarch64_alpine-linux_testList_1 ❌ ABORTED ❌ jdk_security4_0 => deep history 0/1 passed | possible issues |

@sophia-guo Why do you say they timed out? The first of those was the top level job that had the two testList ones underneath it. The first testList0 was stopped by Andrew after 4h03m when it typically takes between 4-5 hours so it did not timeout, but was stopped before it got to the normal time it would take to complete:

image

15:40:30  XML output with verification to /home/jenkins/workspace/Test_openjdk17_hs_extended.openjdk_aarch64_alpine-linux_testList_0/aqa-tests/TKG/output_16563194353924/jdk_security3_2/work
Aborted by [Andrew Leonard](https://ci.adoptopenjdk.net/user/andrew-m-leonard)
15:46:27  Sending interrupt signal to process
15:46:31  Terminated

The testlList_1 job was the same - killed earlier than the amount of time it would normally take to run.

image

15:45:29  XML output with verification to /home/jenkins/workspace/Test_openjdk17_hs_extended.openjdk_aarch64_alpine-linux_testList_1/aqa-tests/TKG/output_16563193343027/jdk_security4_0/work
Aborted by [Andrew Leonard](https://ci.adoptopenjdk.net/user/andrew-m-leonard)
15:46:18  Sending interrupt signal to process
15:46:20  Terminated

As a result of those two being aborted, the top level one was marked as aborted too.

As I said earlier, I haven't seen any evidence that JDK17/alpine/aarch64 has experienced any unusual hangs, only forced aborts.

https://ci.adoptopenjdk.net/job/Test_openjdk17_hs_extended.openjdk_aarch64_alpine-linux_testList_1/1/ may have hit a timeout, but (a) there were a lot of hung Grinder processes on the machine at that time which an be seen in the log, so I wouldn't count that run as conclusive evidence, and (b) it seems to have been trying to load the AWT libraries so wasn't running in headless mode which could have caused additional problems.

My recommendation remains that if we're going to change this, we should be doing it for JDK11u/aarch64 ONLY.

@github-actions github-actions bot added testing and removed testing labels Jul 1, 2022
@andrew-m-leonard andrew-m-leonard changed the title Remove aarch64Alpine from default pipeline config due to test hangs reliability Remove aarch64Alpine from default jdk11u pipeline config due to test hangs reliability Jul 1, 2022
@andrew-m-leonard
Copy link
Contributor Author

We believe this is just a jdk11u problem, so updated PR to only remove from jdk11u.

@andrew-m-leonard
Copy link
Contributor Author

run tests

@eclipse-temurin-bot
Copy link
Collaborator

 PR TESTER RESULT 

❎ Some pipelines failed or the job was aborted! ❎
See the pipeline-build-check below for more information...

@andrew-m-leonard
Copy link
Contributor Author

run tests

@eclipse-temurin-bot
Copy link
Collaborator

 PR TESTER RESULT 

❎ Some pipelines failed or the job was aborted! ❎
See the pipeline-build-check below for more information...

@andrew-m-leonard
Copy link
Contributor Author

run tests

@github-actions github-actions bot added testing and removed testing labels Jul 1, 2022
@eclipse-temurin-bot
Copy link
Collaborator

 PR TESTER RESULT 

❎ Some pipelines failed or the job was aborted! ❎
See the pipeline-build-check below for more information...

@andrew-m-leonard
Copy link
Contributor Author

run tests

Copy link
Member

@sxa sxa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm +1 on this change after restricting to JDK11u/aarch64.
We have adoptium/temurin-build#2961 to cover the development of a fix, after which we can re-enable.

@eclipse-temurin-bot
Copy link
Collaborator

 PR TESTER RESULT 

❎ Some pipelines failed or the job was aborted! ❎
See the pipeline-build-check below for more information...

@andrew-m-leonard andrew-m-leonard merged commit 0d9963b into adoptium:master Jul 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

5 participants