-
-
Notifications
You must be signed in to change notification settings - Fork 102
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Visible Nagios alerts for available infra not meeting Temurin minimum SLA ? #3372
Comments
Moved into iteration 4, whilst we scope the requirements. |
Ref high CPU alerts, I think we can probably strip it down a bit. The only time I've seen a real problem was when the ppc64le boxes shot up to a level way in excess of what I would expect. I would tentatively propose that we set it to alert if the CPU usage is over 200% over 5 minutes (or perhaps over 90% for a 24 hours period, but not sure if that's feasible). |
We'll also need to consider how useful the memory warnings are, and specifically whether they are something we need to take any action on - we're getting about one of those every 1-2 weeks:
|
We've got a lot of these on the macincloud which is concerning and likely needs remediation if it's correct for the macos file system: I'll raise an issue ... |
We don't currently have a specific SLA, so I'm not sure we can be considered as not meeting them. So we have two actions here (Noting that we already have some rules in Nagios for checking for percentages of available machines, and total numbers etc. but this needs to be enhanced. Plan of action:
|
Current checks and thresholds in Nagios. Note that this is just the raw info without much extra detail to give an idea of the sort of things being checked for input into this discussion.
|
I think the above covers it, but the specific infrastructure checks based on the number of agents online in jenkins, with certain labels are shown below: All of these count the total number of machines with the labels listed, and then warn and alert at various thresholds...
|
A complete list of checks across all the infra can be found on this link: A complete list of the current warnings and alerts can be found on this link: |
The base templates used by the automated configuration for nodes can be seen here.. These can then be customised post installation, should that be required. |
Thanks @steelhead31 this looks great. I'll review this, cheers |
@steelhead31 how often does Nagios "poll" for these thresholds? |
Initial review @steelhead31 Can be removed as replaced by dynamic Orka:
Corrections?
Need ?
|
@andrew-m-leonard To be clear are your "corrections?" ones purely to change the labels on the existing checks and "need?" is your proposal for things to add? |
Daily as it stands, though its configurable... more often = more noise though :) |
yes please, correct |
One click overview of current issues: https://nagios.adoptopenjdk.net/nagios/cgi-bin/status.cgi?host=all&servicestatustypes=16&hoststatustypes=15 |
I like this view @steelhead31 that could be my goto place for critical alerts
|
To my mind:
is actually something we'd want to address so should be flagged since the machine is overworked and hitting issues as a result (we've spotted the issue here and it will be dealt with under #3375) But something like
isn't a real concern (in fact it's good that the jobs are making full use of the machines!) so is something we should certainly avoid posting that to slack. |
Adjust jenkins warning and alert thresholds for jobs & workspace to warn at 95 and critical at 98 due to the large filesystems. |
@sxa & @andrew-m-leonard , I've rationalised the discussion above into an easy to follow list.. I'll implement when Im back on Monday.. sadly I'm fairly limited in what I can do as regards Unix machine CPU load, but I think my proposed changed to running the check with a slightly different setup, should hopefully be more useful, we can always review and fix the thresholds again..
Windows : Warn If Over 90% 12 Hours(720), Critical If Over 90% 18 Hours (108) Calculated by dividing the load average by number of CPUs ( 1.0 is fully loaded )
Remove These Checks: 2.01) Check: x64 Macos10.14 Build Machines 2.02) Check: Macos10.14 with xcode10 Machines Correct These Checks 2.03) ci.role.test&&hw.arch.x86&&sw.os.windows: Warn : 75% Online, Alert : 30% Online 2.04) ci.role.test&&hw.arch.s390x&&sw.os.linux: Warn : 75% Online, Alert : 30% Online Add These Checks 2.05) build&&linux&&aarch64&&dockerBuild: Warn : 51% Online, Alert : 30% Online
On the jobs & workspace partitions , warn at 95 and critical at 98 due to the large filesystems involved. |
I honestly still think that'll be too noisy for non-dockerhost systems since it'll trigger on every run of the system test suites (and probably more). If we're using load/cores when we should use higher numbers e.g. warn if over 110% for 15 minutes, critical if over maybe 150 over either 5 or 15 minutes? I'm running a sanity.system on a 2 core system and the 1-minute load went above 2 just when building the material (up to about 2.5). During the actualy test run at step 5 of TestJlmRemoteClassAuth_0 I was seeing numbers like [EDIT: For a 4-core system I'm seeing similar figures during the test: However for dockerhost systems, the values you suggest are likely a reasonable starting point. |
Update 1: |
Update 2: |
Update 3: |
Tweaked solaris & SLES load thresholds to |
All works completed. New issues should be raised for future improvement. |
I'd like to see a visible alert if the number of infra nodes does not meet a defined SLA.
eg.The number of "online" ci.role.test&&aarch64 nodes < 6, then BIG Alert on Slack, that's not hidden by 100s of other Nagios alerts?
The text was updated successfully, but these errors were encountered: