Agents going into drain because of unwanted disk partitions are above threshold #12139

amaltaro · 2024-10-09T10:23:02Z

Impact of the bug
WMAgent

Describe the bug
We have a couple of agents that were automatically set to drain mode because one of their disk partitions is above the threshold set in the agent (set to 85% utilization at the moment).

How to reproduce it
Just fill up any partition in the node above 85%.

Expected behavior
We need to either disable those specific partitions in the configuration, such that the agent stops monitoring and acting on them.
Or we should actually create an allowed list of partitions to be monitored (this would be a larger change though).

Additional context and error message
Examples:

agent: vocms0281(2.3.4.4)
disk warning:
    /etc/group:85%

and

agent: vocms0283(2.3.4.4)
disk warning:
    /data/certs:85%

The text was updated successfully, but these errors were encountered:

hassan11196 · 2024-10-09T10:53:58Z

It seems to be that the /data mount on these machines has reached 85%. The df -klP command inside the wmagent containers however displays /etc/group and /data/certs inside vocms0281 and vocms0283.

hassan11196 · 2024-10-09T11:05:28Z

I looked at the container mounts and noticed that inside the container df -klP outputs the mount Destination out of space instead of the source.

For example on vocms0283 the output of docker inspect wmagent and looking into the mounts, The /etc/group is mounted from /data/dockerMount/admin/etc/group.

 "Mounts": [
            {
                "Type": "bind",
                "Source": "/tmp",
                "Destination": "/tmp",
                "Mode": "",
                "RW": true,
                "Propagation": "rprivate"
            },
            {
                "Type": "bind",
                "Source": "/data/dockerMount/admin/wmagent",
                "Destination": "/data/admin/wmagent",
                "Mode": "",
                "RW": true,
                "Propagation": "rprivate"
            },
            {
                "Type": "bind",
                "Source": "/data/dockerMount/admin/etc/group",
                "Destination": "/etc/group",
                "Mode": "",
                "RW": false,
                "Propagation": "rprivate"
            },
            {
                "Type": "bind",
                "Source": "/etc/tnsnames.ora",
                "Destination": "/etc/tnsnames.ora",
                "Mode": "",
                "RW": false,
                "Propagation": "rprivate"
            },
            {
                "Type": "bind",
                "Source": "/data/certs",
                "Destination": "/data/certs",
                "Mode": "",
                "RW": true,
                "Propagation": "rprivate"
            },
            {
                "Type": "bind",
                "Source": "/etc/sudoers",
                "Destination": "/etc/sudoers",
                "Mode": "",
                "RW": false,
                "Propagation": "rprivate"
            },

mapellidario · 2024-10-09T11:58:20Z

why do we have a problem

A couple of additional pointers (maybe these are known to everybody, but future dario will thank me):

data in wmstats is displayed at the following line only if agentInfo.disk_warning is not null

WMCore/src/couchapps/WMStats/_attachments/js/Views/HTMLList/WMStats.AgentDetailList.js

Line 24 in f3c25d8

    
           formatStr += "<li><b>" + diskList[i].mounted +"</b>:" + diskList[i].percent +"</li>";

that information comes from AgentStatusPoller

WMCore/src/python/WMComponent/AgentStatusWatcher/AgentStatusPoller.py

Line 204 in f3c25d8

    
           agentInfo['disk_warning'] = listDiskUsageOverThreshold(self.config, updateDB=True)

that uses df -klP (thank Hassan for spotting it!), parsing only the last two columns

WMCore/src/python/Utils/Utilities.py

Line 96 in f3c25d8

df = subprocess.Popen(["df", "-klP"], stdout=subprocess.PIPE)

As Hassan described, df -klP from inside the docker container reports path of the directory from the container point of view.

Since we bind mount from the vm host into the docker container multiple directories that in the host are inside /data directory on the host, when df runs it selects one of the "destination" mountpoint that refer to the host filesystem/partition that it is reporting data about. For example, notice how df reports disk use on the host [0] (filesystem: /dev/vdb, mountpoint /data), while on the container the mountpoint is not consistent across different wmagents [1] and [2].

what can we do to solve it

So, this is an unexpected regression of deploying the WMCore with docker containers.

The cleanest solution that comes to my mind is to always identify a partition by the "filesystem" column and not by the "mounted on", chaning in Utilities.py

- diskPercent.append({'mounted': split[5], 'percent': split[4]})
+ diskPercent.append({'mounted': split[0], 'percent': split[4]})

so that we consistently report the "/data" directory with "/dev/vdb".

Otherwise, the internet suggests to bind mount the root director of the host in readonly [3], but i really dislike this approach.

In any case, I would like to hear a quick opinion from @todor-ivanov , who i trust with all these docker shenanigans :)

anatomy of df -klP

[0]

(on the host)

cmst1@vocms0281:dmapelli $ df -klP
Filesystem     1024-blocks      Used Available Capacity Mounted on
/dev/vda1        334914540  14111860 320802680       5% /
/dev/vda15          556948      7228    549720       2% /boot/efi
tmpfs              8388608   1069908   7318700      13% /mnt/ramdisk
/dev/vdb        1031911884 827451044 154616384      85% /data

[1]

(inside the container, vocms0281)

(WMAgent-2.3.4.4) [cmst1@vocms0281:current]$ df -klP
Filesystem     1024-blocks      Used Available Capacity Mounted on
overlay          334914540  14111860 320802680       5% /
/dev/vda1        334914540  14111860 320802680       5% /tmp
/dev/vdb        1031911884 827448800 154618628      85% /etc/group

[2]

(inside the container, vocms0256)

(WMAgent-2.3.4.4) [cmst1@vocms0256:current]$ df -klP
Filesystem     1024-blocks      Used Available Capacity Mounted on
overlay          334914540  11805236 323109304       4% /
/dev/vda1        334914540  11805236 323109304       4% /tmp
/dev/vdb        1031911884 772601688 209465740      79% /data/certs

[3] netdata/netdata#16444 (comment)

todor-ivanov · 2024-10-09T13:02:29Z

hi @mapellidario your explanation is complete and good.

About:

Otherwise, the internet suggests to bind mount the root director of the host in readonly [3], but i really dislike this approach.

I definitely would not want to enlarge with yet another item my list of things to regret about before I meet the Reaper.

About this though:

The cleanest solution that comes to my mind is to always identify a partition by the "filesystem" column and not by the "mounted on", chaning in Utilities.py

Your suggestion is one way to go, but in this case people will always see only the volume as reaching the threshold, which is indeed true and fair. Another way to go would be simply to visualize all mount points including the duplicate mounts:

in

WMCore/src/python/Utils/Utilities.py

Line 96 in f3c25d8

df = subprocess.Popen(["df", "-klP"], stdout=subprocess.PIPE)

the following change should do the job:

- df = subprocess.Popen(["df", "-klP"], stdout=subprocess.PIPE) 
+ df = subprocess.Popen(["df", "-klPa"], stdout=subprocess.PIPE)

The choice should be up to whom ever takes the issue to work on. Both will do.

amaltaro · 2024-10-09T13:54:54Z

@mapellidario can you please take this issue on? I intent no pressure, but if we can have this development by the end of this week, we can already consider it in the upcoming WMAgent release.

amaltaro added Operations WMAgent labels Oct 9, 2024

amaltaro added this to WMCore quarterly developments Oct 9, 2024

amaltaro moved this to ToDo in WMCore quarterly developments Oct 9, 2024

mapellidario self-assigned this Oct 9, 2024

mapellidario moved this from ToDo to In Progress in WMCore quarterly developments Oct 9, 2024

mapellidario mentioned this issue Oct 10, 2024

wmstats - consistent reporting of wmagent disk use #12140

Merged

amaltaro closed this as completed in #12140 Oct 17, 2024

github-project-automation bot moved this from In Progress to Done in WMCore quarterly developments Oct 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Agents going into drain because of unwanted disk partitions are above threshold #12139

Agents going into drain because of unwanted disk partitions are above threshold #12139

amaltaro commented Oct 9, 2024 •

edited

Loading

hassan11196 commented Oct 9, 2024

hassan11196 commented Oct 9, 2024 •

edited

Loading

mapellidario commented Oct 9, 2024

todor-ivanov commented Oct 9, 2024 •

edited

Loading

amaltaro commented Oct 9, 2024

Agents going into drain because of unwanted disk partitions are above threshold #12139

Agents going into drain because of unwanted disk partitions are above threshold #12139

Comments

amaltaro commented Oct 9, 2024 • edited Loading

hassan11196 commented Oct 9, 2024

hassan11196 commented Oct 9, 2024 • edited Loading

mapellidario commented Oct 9, 2024

why do we have a problem

what can we do to solve it

todor-ivanov commented Oct 9, 2024 • edited Loading

amaltaro commented Oct 9, 2024

amaltaro commented Oct 9, 2024 •

edited

Loading

hassan11196 commented Oct 9, 2024 •

edited

Loading

todor-ivanov commented Oct 9, 2024 •

edited

Loading