Visible Nagios alerts for available infra not meeting Temurin minimum SLA ? #3372

andrew-m-leonard · 2024-02-07T13:30:24Z

I'd like to see a visible alert if the number of infra nodes does not meet a defined SLA.
eg.The number of "online" ci.role.test&&aarch64 nodes < 6, then BIG Alert on Slack, that's not hidden by 100s of other Nagios alerts?

steelhead31 · 2024-02-07T14:05:52Z

Moved into iteration 4, whilst we scope the requirements.

sxa · 2024-02-07T15:17:25Z

Ref high CPU alerts, I think we can probably strip it down a bit. The only time I've seen a real problem was when the ppc64le boxes shot up to a level way in excess of what I would expect. I would tentatively propose that we set it to alert if the CPU usage is over 200% over 5 minutes (or perhaps over 90% for a 24 hours period, but not sure if that's feasible).

sxa · 2024-02-07T15:17:28Z

We'll also need to consider how useful the memory warnings are, and specifically whether they are something we need to take any action on - we're getting about one of those every 1-2 weeks:

Nov 15: HOST: test-azure-win2012r2-x64-3 SERVICE: Memory Usage STATE: WARNING MESSAGE: Memory usage: total:6186.16 MB - used: 5449.47 MB (88%) - free: 736.69 MB (12%)
Nov 23: HOST: test-azure-win2012r2-x64-3 SERVICE: Memory Usage STATE: WARNING MESSAGE: Memory usage: total:6234.91 MB - used: 5459.21 MB (88%) - free: 775.70 MB (12%)
Dec 3: HOST: test-azure-win2012r2-x64-3 SERVICE: Memory Usage STATE: WARNING MESSAGE: Memory usage: total:5836.37 MB - used: 4786.06 MB (82%) - free: 1050.31 MB (18%)
Dec 5: HOST: test-azure-win2012r2-x64-3 SERVICE: Memory Usage STATE: WARNING MESSAGE: Memory usage: total:6416.96 MB - used: 5264.50 MB (82%) - free: 1152.46 MB (18%)
Dec 14 (Critical): HOST: build-ibmcloud-win2012r2-x64-2 SERVICE: Memory Usage STATE: CRITICAL MESSAGE: CRITICAL - Socket timeout
Jan 5: (Critical): HOST: build-azure-win2022-x64-2 SERVICE: Memory Usage STATE: CRITICAL MESSAGE: CRITICAL - Socket timeout
Feb 7: HOST: test-azure-win11-aarch64-1 SERVICE: Memory Usage STATE: WARNING MESSAGE: Memory usage: total:9695.75 MB - used: 7895.15 MB (81%) - free: 1800.60 MB (19%)
The critical ones are potentially not directly memory related (Unless it had gone critical because of memory starvation)
test-azure-win2012r2-x64-3, which had most of the memory issues, no longer exists, so we won't see those again. OK I've convinced myself we can leave the memory warnings as-is :-)

sxa · 2024-02-07T15:22:43Z

We've got a lot of these on the macincloud which is concerning and likely needs remediation if it's correct for the macos file system:
HOST: test-macincloud-macos1201-x64-1 SERVICE: Disk Space Root Partition STATE: WARNING MESSAGE: DISK WARNING - free space: / 16525 MiB (13.47% inode=100%)
HOST: test-macincloud-macos1201-x64-2 SERVICE: Disk Space Root Partition STATE: WARNING MESSAGE: DISK WARNING - free space: / 23837 MiB (19.43% inode=100%)

I'll raise an issue ...

sxa · 2024-02-07T15:29:41Z

We don't currently have a specific SLA, so I'm not sure we can be considered as not meeting them. So we have two actions here (Noting that we already have some rules in Nagios for checking for percentages of available machines, and total numbers etc. but this needs to be enhanced. Plan of action:

Define the checks we want (and document them!)
Implement the checks
Confirm that everyone who needs the information is happy that #infrastructure-bot is producing useful information

sxa · 2024-02-07T16:03:47Z

Current checks and thresholds in Nagios. Note that this is just the raw info without much extra detail to give an idea of the sort of things being checked for input into this discussion.

	check_command			check_local_disk!20%!10%!/
	check_command			check_local_users!10!15
	check_command			check_local_load!5.0,4.0,3.0!10.0,6.0,4.0
	check_command			check_local_swap!20!10
	check_command			check_ssh
	check_command			check_http
        check_command                   check_local_mem!15!5
        check_command			check_local_apt
#        check_command                   check_label!arm&&build!80!66
#        check_command                   check_label!arm&&ci.role.test!30!10
        check_command                   check_label!hw.arch.aarch32&&ci.role.test&&sw.os.linux!30!10
#        check_command                   check_label!ubuntu&&fpm!99!98
        check_command                   check_label!build&&linux&&s390x!75!30
        check_command                   check_label!test&&linux&&s390x!75!30
        check_command                   check_label!build&&windows&&x64!75!30
        check_command                   check_label!test&&windows&&x64!75!30
        check_command                   check_label!build&&openj9&&linux&&s390x!75!30
        check_command                   check_label!macos10.14&&build&&mac&&x64!75!30
        check_command                   check_label!mac&&macos10.14&&xcode10!65!33
        check_command                   check_label!ci.role.test&&sw.os.linux&&hw.arch.riscv!65!33
        check_command                   check_label!ci.role.test&&sw.os.alpine-linux&&hw.arch.aarch64!65!33
        check_command                   check_inventory
        check_command                   check_nagios_sync
        check_command                   check_label!wix!75!30

steelhead31 · 2024-02-07T16:14:50Z

I think the above covers it, but the specific infrastructure checks based on the number of agents online in jenkins, with certain labels are shown below:

All of these count the total number of machines with the labels listed, and then warn and alert at various thresholds...

1) Check: Arm32 Linux Test Machines
   Labels: ci.role.test & sw.os.linux & hw.arch.aarch32
   Warn : 30% Online, Alert : 10% Online

2) Check: s390x Linux Build Machines
   Labels: build & linux & s390x
   Warn : 75% Online, Alert : 30% Online

3) Check: s390x Linux Test Machines
   Labels: test & linux & s390x
   Warn : 75% Online, Alert : 30% Online

4) Check: x64 Windows Build Machines
   Labels: build & windows & x64
   Warn : 75% Online, Alert : 30% Online

5) Check: x64 Windows Test Machines
   Labels: test & windows & x64
   Warn : 75% Online, Alert : 30% Online

6) Check: s390x build openj9 linux Machines
   Labels: build & linux & s390x & openj9
   Warn : 75% Online, Alert : 30% Online

7) Check: x64 Macos10.14 Build Machines
   Labels: macos10.14 & build & mac & x64
   Warn : 75% Online, Alert : 30% Online

8) Check: Macos10.14 with xcode10 Machines
   Labels: macos10.14 & mac & xcode10
   Warn : 65% Online, Alert : 33% Online

9) Check: Risc-V 64 test Machines
   Labels: ci.role.test & sw.os.linux & hw.arch.riscv
   Warn : 65% Online, Alert : 33% Online

10) Check: Alpine ARM64 Test Machines
    Labels: ci.role.test & sw.os.alpine-linux & hw.arch.aarch64
    Warn : 65% Online, Alert : 33% Online

11) Check: Wix Machines
    Labels: wix
    Warn : 75% Online, Alert : 30% Online

steelhead31 · 2024-02-07T16:17:06Z

A complete list of checks across all the infra can be found on this link:
https://nagios.adoptopenjdk.net/nagios/cgi-bin/status.cgi?host=all

A complete list of the current warnings and alerts can be found on this link:
https://nagios.adoptopenjdk.net/nagios/cgi-bin/status.cgi?host=all&servicestatustypes=28

steelhead31 · 2024-02-07T16:18:22Z

The base templates used by the automated configuration for nodes can be seen here..

https://github.com/adoptium/infrastructure/tree/master/ansible/playbooks/nagios/roles/Nagios_Config/files/templates

These can then be customised post installation, should that be required.

andrew-m-leonard · 2024-02-08T11:18:50Z

I think the above covers it, but the specific infrastructure checks based on the number of agents online in jenkins, with certain labels are shown below:

All of these count the total number of machines with the labels listed, and then warn and alert at various thresholds...

1) Check: Arm32 Linux Test Machines
   Labels: ci.role.test & sw.os.linux & hw.arch.aarch32
   Warn : 30% Online, Alert : 10% Online

2) Check: s390x Linux Build Machines
   Labels: build & linux & s390x
   Warn : 75% Online, Alert : 30% Online

3) Check: s390x Linux Test Machines
   Labels: test & linux & s390x
   Warn : 75% Online, Alert : 30% Online

4) Check: x64 Windows Build Machines
   Labels: build & windows & x64
   Warn : 75% Online, Alert : 30% Online

5) Check: x64 Windows Test Machines
   Labels: test & windows & x64
   Warn : 75% Online, Alert : 30% Online

6) Check: s390x build openj9 linux Machines
   Labels: build & linux & s390x & openj9
   Warn : 75% Online, Alert : 30% Online

7) Check: x64 Macos10.14 Build Machines
   Labels: macos10.14 & build & mac & x64
   Warn : 75% Online, Alert : 30% Online

8) Check: Macos10.14 with xcode10 Machines
   Labels: macos10.14 & mac & xcode10
   Warn : 65% Online, Alert : 33% Online

9) Check: Risc-V 64 test Machines
   Labels: ci.role.test & sw.os.linux & hw.arch.riscv
   Warn : 65% Online, Alert : 33% Online

10) Check: Alpine ARM64 Test Machines
    Labels: ci.role.test & sw.os.alpine-linux & hw.arch.aarch64
    Warn : 65% Online, Alert : 33% Online

11) Check: Wix Machines
    Labels: wix
    Warn : 75% Online, Alert : 30% Online

Thanks @steelhead31 this looks great. I'll review this, cheers

andrew-m-leonard · 2024-02-08T11:19:42Z

@steelhead31 how often does Nagios "poll" for these thresholds?

andrew-m-leonard · 2024-02-08T12:04:01Z

Initial review @steelhead31

Can be removed as replaced by dynamic Orka:

7) Check: x64 Macos10.14 Build Machines
   Labels: macos10.14 & build & mac & x64
   Warn : 75% Online, Alert : 30% Online

8) Check: Macos10.14 with xcode10 Machines
   Labels: macos10.14 & mac & xcode10
   Warn : 65% Online, Alert : 33% Online

Corrections?

ci.role.test&&hw.arch.x86&&sw.os.windows:  Warn : 75% Online, Alert : 30% Online

ci.role.test&&hw.arch.s390x&&sw.os.linux: Warn : 75% Online, Alert : 30% Online

Need ?

build&&linux&&aarch64&&dockerBuild: Warn : 51% Online, Alert : 30% Online
ci.role.test&&sw.os.linux&&hw.arch.aarch64: Warn : 75% Online, Alert : 30% Online

build&&linux&&x64&&dockerBuild: Warn : 51% Online, Alert : 30% Online
ci.role.test&&hw.arch.x86&&sw.os.linux: Warn : 75% Online, Alert : 30% Online

build&&alpine-linux&&x64&&dockerBuild: Warn : 51% Online, Alert : 30% Online
ci.role.test&&hw.arch.x86&&sw.os.alpine-linux: Warn : 75% Online, Alert : 30% Online

build&&alpine-linux&&aarch64&&dockerBuild: Warn : 51% Online, Alert : 30% Online
ci.role.test&&hw.arch.aarch64&&sw.os.alpine-linux: Warn : 75% Online, Alert : 30% Online

aix720&&build&&aix&&ppc64: Warn : 51% Online, Alert : 30% Online
ci.role.test&&hw.arch.ppc64&&sw.os.aix&&sw.os.aix.7_2: Warn : 75% Online, Alert : 30% Online

build&&linux&&ppc64le&&dockerBuild: Warn : 51% Online, Alert : 30% Online
ci.role.test&&hw.arch.ppc64le&&sw.os.linux: Warn : 75% Online, Alert : 30% Online

build&&windows&&x86-32: Warn : 51% Online, Alert : 30% Online

sxa · 2024-02-08T12:11:42Z

@andrew-m-leonard To be clear are your "corrections?" ones purely to change the labels on the existing checks and "need?" is your proposal for things to add?

steelhead31 · 2024-02-08T12:34:29Z

@steelhead31 how often does Nagios "poll" for these thresholds?

Daily as it stands, though its configurable... more often = more noise though :)

andrew-m-leonard · 2024-02-08T14:03:05Z

@andrew-m-leonard To be clear are your "corrections?" ones purely to change the labels on the existing checks and "need?" is your proposal for things to add?

yes please, correct

steelhead31 · 2024-02-09T10:12:46Z

One click overview of current issues: https://nagios.adoptopenjdk.net/nagios/cgi-bin/status.cgi?host=all&servicestatustypes=16&hoststatustypes=15

andrew-m-leonard · 2024-02-09T12:14:36Z

One click overview of current issues: https://nagios.adoptopenjdk.net/nagios/cgi-bin/status.cgi?host=all&servicestatustypes=16&hoststatustypes=15

I like this view @steelhead31 that could be my goto place for critical alerts
If we could perhaps:

Make CurrentLoad and CPULoad alerts be more likely a Warning, maybe only Critical if say 99% for 6hours ?
Check Label- macos10.14/build/mac/x64 Check needs removing as using Orka

sxa · 2024-02-09T13:39:02Z

* Make CurrentLoad and CPULoad alerts be more likely a Warning, maybe only Critical if say 99% for 6hours ?

To my mind:

CRITICAL - load average: 212.13, 161.58, 167.15

is actually something we'd want to address so should be flagged since the machine is overworked and hitting issues as a result (we've spotted the issue here and it will be dealt with under #3375)

But something like

CPU Load 99% (5 min average)

isn't a real concern (in fact it's good that the jobs are making full use of the machines!) so is something we should certainly avoid posting that to slack.

steelhead31 · 2024-02-13T10:16:10Z

Adjust jenkins warning and alert thresholds for jobs & workspace to warn at 95 and critical at 98 due to the large filesystems.

steelhead31 · 2024-02-15T10:01:15Z

@sxa & @andrew-m-leonard , I've rationalised the discussion above into an easy to follow list.. I'll implement when Im back on Monday.. sadly I'm fairly limited in what I can do as regards Unix machine CPU load, but I think my proposed changed to running the check with a slightly different setup, should hopefully be more useful, we can always review and fix the thresholds again..

Machine Load

Windows : Warn If Over 90% 12 Hours(720), Critical If Over 90% 18 Hours (108)
Unix : Warn if 1,5 & 15 minute averages are above = 95%, 90%, 85% across all cores
Critical if 1,5, & 15 minute averages are above = 100%, 99%, 90%

Calculated by dividing the load average by number of CPUs ( 1.0 is fully loaded )

Label Checks ( Runs On Nagios Server Against Jenkins )

Remove These Checks:

2.01) Check: x64 Macos10.14 Build Machines
Labels: macos10.14 & build & mac & x64
Warn : 75% Online, Alert : 30% Online

2.02) Check: Macos10.14 with xcode10 Machines
Labels: macos10.14 & mac & xcode10
Warn : 65% Online, Alert : 33% Online

Correct These Checks

2.03) ci.role.test&&hw.arch.x86&&sw.os.windows: Warn : 75% Online, Alert : 30% Online

2.04) ci.role.test&&hw.arch.s390x&&sw.os.linux: Warn : 75% Online, Alert : 30% Online

Add These Checks

2.05) build&&linux&&aarch64&&dockerBuild: Warn : 51% Online, Alert : 30% Online
2.06) ci.role.test&&sw.os.linux&&hw.arch.aarch64: Warn : 75% Online, Alert : 30% Online
2.07) build&&linux&&x64&&dockerBuild: Warn : 51% Online, Alert : 30% Online
2.08) ci.role.test&&hw.arch.x86&&sw.os.linux: Warn : 75% Online, Alert : 30% Online
2.09) build&&alpine-linux&&x64&&dockerBuild: Warn : 51% Online, Alert : 30% Online
2.10)ci.role.test&&hw.arch.x86&&sw.os.alpine-linux: Warn : 75% Online, Alert : 30% Online
2.11) build&&alpine-linux&&aarch64&&dockerBuild: Warn : 51% Online, Alert : 30% Online
2.12) ci.role.test&&hw.arch.aarch64&&sw.os.alpine-linux: Warn : 75% Online, Alert : 30% Online
2.13) aix720&&build&&aix&&ppc64: Warn : 51% Online, Alert : 30% Online
2.14) ci.role.test&&hw.arch.ppc64&&sw.os.aix&&sw.os.aix.7_2: Warn : 75% Online, Alert : 30% Online
2.15) build&&linux&&ppc64le&&dockerBuild: Warn : 51% Online, Alert : 30% Online
2.16) ci.role.test&&hw.arch.ppc64le&&sw.os.linux: Warn : 75% Online, Alert : 30% Online
2.17) build&&windows&&x86-32: Warn : 51% Online, Alert : 30% Online

Adjust The Jenkins Server Disk Space Thresholds

On the jobs & workspace partitions , warn at 95 and critical at 98 due to the large filesystems involved.

sxa · 2024-02-15T11:40:18Z

Unix : Warn if 1,5 & 15 minute averages are above = 95%, 90%, 85% across all cores
Critical if 1,5, & 15 minute averages are above = 100%, 99%, 90%

Calculated by dividing the load average by number of CPUs ( 1.0 is fully loaded )

I honestly still think that'll be too noisy for non-dockerhost systems since it'll trigger on every run of the system test suites (and probably more). If we're using load/cores when we should use higher numbers e.g. warn if over 110% for 15 minutes, critical if over maybe 150 over either 5 or 15 minutes? I'm running a sanity.system on a 2 core system and the 1-minute load went above 2 just when building the material (up to about 2.5). During the actualy test run at step 5 of TestJlmRemoteClassAuth_0 I was seeing numbers like 19.79, 9.87, 4.04 so that would trigger a critical alert regardless of what we use on a 2-core system (although the indivdual CPUs are showing only around 50% usage just now)

[EDIT: For a 4-core system I'm seeing similar figures during the test: 21.14, 13.83, 6.25 so that would still blow a "150% over 15 minutes" check]
[EDIT 2: That job, an bit further down the line, got to a load reading of 21.38, 18.99, 12.09 - I think for the non-dockerhost systems we need to start by disable the posting of the CPU alerts to slack (or disable them completely if there's no way of generating the alert in the UI but not posting to slack]

However for dockerhost systems, the values you suggest are likely a reasonable starting point.

steelhead31 · 2024-02-19T09:49:37Z

Update 1:
Machine Load Parameters Have All Been adjusted
For Windows Machines : Warn If Over 90% 12 Hours(720), Critical If Over 90% 18 Hours (108)
For Unix/Dockerhost Machines: if ENTIRE machine ( ALL CORES ), for 1,5 & 15 minute averages are
warn = 95%, 90%, 85%
critical = 100%, 99%, 90%

steelhead31 · 2024-02-19T10:54:46Z

Update 2:
Checks have been removed/modified and added as per request, now visible in Nagios.

steelhead31 · 2024-02-19T10:55:10Z

Update 3:
Jenkins disk space thresholds have been adjusted.

steelhead31 · 2024-02-19T11:01:53Z

Tweaked solaris & SLES load thresholds to
check_load -r -w 3.00, 2.5, 1.50 -c 4.00, 3.00, 2.00

steelhead31 · 2024-02-19T13:53:29Z

All works completed. New issues should be raised for future improvement.

andrew-m-leonard assigned sxa Feb 7, 2024

andrew-m-leonard added this to 2024 1Q Adoptium Plan Feb 7, 2024

sxa assigned steelhead31 Feb 7, 2024

steelhead31 moved this to Todo in 2024 1Q Adoptium Plan Feb 7, 2024

sxa mentioned this issue Feb 7, 2024

macincloud systems reporting 100% inode use in Nagios #3373

Closed

steelhead31 moved this from Todo to In Progress in 2024 1Q Adoptium Plan Feb 19, 2024

steelhead31 closed this as completed Feb 19, 2024

github-project-automation bot moved this from In Progress to Done in 2024 1Q Adoptium Plan Feb 19, 2024

sxa added this to the 2024-02 (February) milestone Feb 20, 2024

sxa added the Nagios Nagios monitoring issues label Feb 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Visible Nagios alerts for available infra not meeting Temurin minimum SLA ? #3372

Visible Nagios alerts for available infra not meeting Temurin minimum SLA ? #3372

andrew-m-leonard commented Feb 7, 2024

steelhead31 commented Feb 7, 2024

sxa commented Feb 7, 2024

sxa commented Feb 7, 2024

sxa commented Feb 7, 2024

sxa commented Feb 7, 2024 •

edited by steelhead31

Loading

sxa commented Feb 7, 2024 •

edited

Loading

steelhead31 commented Feb 7, 2024

steelhead31 commented Feb 7, 2024

steelhead31 commented Feb 7, 2024

andrew-m-leonard commented Feb 8, 2024

andrew-m-leonard commented Feb 8, 2024

andrew-m-leonard commented Feb 8, 2024

sxa commented Feb 8, 2024

steelhead31 commented Feb 8, 2024

andrew-m-leonard commented Feb 8, 2024

steelhead31 commented Feb 9, 2024

andrew-m-leonard commented Feb 9, 2024

sxa commented Feb 9, 2024 •

edited

Loading

steelhead31 commented Feb 13, 2024

steelhead31 commented Feb 15, 2024

sxa commented Feb 15, 2024 •

edited

Loading

steelhead31 commented Feb 19, 2024

steelhead31 commented Feb 19, 2024

steelhead31 commented Feb 19, 2024

steelhead31 commented Feb 19, 2024

steelhead31 commented Feb 19, 2024

Visible Nagios alerts for available infra not meeting Temurin minimum SLA ? #3372

Visible Nagios alerts for available infra not meeting Temurin minimum SLA ? #3372

Comments

andrew-m-leonard commented Feb 7, 2024

steelhead31 commented Feb 7, 2024

sxa commented Feb 7, 2024

sxa commented Feb 7, 2024

sxa commented Feb 7, 2024

sxa commented Feb 7, 2024 • edited by steelhead31 Loading

sxa commented Feb 7, 2024 • edited Loading

steelhead31 commented Feb 7, 2024

steelhead31 commented Feb 7, 2024

steelhead31 commented Feb 7, 2024

andrew-m-leonard commented Feb 8, 2024

andrew-m-leonard commented Feb 8, 2024

andrew-m-leonard commented Feb 8, 2024

sxa commented Feb 8, 2024

steelhead31 commented Feb 8, 2024

andrew-m-leonard commented Feb 8, 2024

steelhead31 commented Feb 9, 2024

andrew-m-leonard commented Feb 9, 2024

sxa commented Feb 9, 2024 • edited Loading

steelhead31 commented Feb 13, 2024

steelhead31 commented Feb 15, 2024

sxa commented Feb 15, 2024 • edited Loading

steelhead31 commented Feb 19, 2024

steelhead31 commented Feb 19, 2024

steelhead31 commented Feb 19, 2024

steelhead31 commented Feb 19, 2024

steelhead31 commented Feb 19, 2024

sxa commented Feb 7, 2024 •

edited by steelhead31

Loading

sxa commented Feb 7, 2024 •

edited

Loading

sxa commented Feb 9, 2024 •

edited

Loading

sxa commented Feb 15, 2024 •

edited

Loading