Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Agents going into drain because of unwanted disk partitions are above threshold #12139

Closed
amaltaro opened this issue Oct 9, 2024 · 5 comments · Fixed by #12140
Closed

Agents going into drain because of unwanted disk partitions are above threshold #12139

amaltaro opened this issue Oct 9, 2024 · 5 comments · Fixed by #12140
Assignees

Comments

@amaltaro
Copy link
Contributor

amaltaro commented Oct 9, 2024

Impact of the bug
WMAgent

Describe the bug
We have a couple of agents that were automatically set to drain mode because one of their disk partitions is above the threshold set in the agent (set to 85% utilization at the moment).

How to reproduce it
Just fill up any partition in the node above 85%.

Expected behavior
We need to either disable those specific partitions in the configuration, such that the agent stops monitoring and acting on them.
Or we should actually create an allowed list of partitions to be monitored (this would be a larger change though).

Additional context and error message
Examples:

agent: vocms0281(2.3.4.4)
disk warning:
    /etc/group:85%

and

agent: vocms0283(2.3.4.4)
disk warning:
    /data/certs:85%
@hassan11196
Copy link
Member

It seems to be that the /data mount on these machines has reached 85%. The df -klP command inside the wmagent containers however displays /etc/group and /data/certs inside vocms0281 and vocms0283.

@hassan11196
Copy link
Member

hassan11196 commented Oct 9, 2024

I looked at the container mounts and noticed that inside the container df -klP outputs the mount Destination out of space instead of the source.

For example on vocms0283 the output of docker inspect wmagent and looking into the mounts, The /etc/group is mounted from /data/dockerMount/admin/etc/group.

 "Mounts": [
            {
                "Type": "bind",
                "Source": "/tmp",
                "Destination": "/tmp",
                "Mode": "",
                "RW": true,
                "Propagation": "rprivate"
            },
            {
                "Type": "bind",
                "Source": "/data/dockerMount/admin/wmagent",
                "Destination": "/data/admin/wmagent",
                "Mode": "",
                "RW": true,
                "Propagation": "rprivate"
            },
            {
                "Type": "bind",
                "Source": "/data/dockerMount/admin/etc/group",
                "Destination": "/etc/group",
                "Mode": "",
                "RW": false,
                "Propagation": "rprivate"
            },
            {
                "Type": "bind",
                "Source": "/etc/tnsnames.ora",
                "Destination": "/etc/tnsnames.ora",
                "Mode": "",
                "RW": false,
                "Propagation": "rprivate"
            },
            {
                "Type": "bind",
                "Source": "/data/certs",
                "Destination": "/data/certs",
                "Mode": "",
                "RW": true,
                "Propagation": "rprivate"
            },
            {
                "Type": "bind",
                "Source": "/etc/sudoers",
                "Destination": "/etc/sudoers",
                "Mode": "",
                "RW": false,
                "Propagation": "rprivate"
            },

@mapellidario
Copy link
Member

why do we have a problem

A couple of additional pointers (maybe these are known to everybody, but future dario will thank me):

data in wmstats is displayed at the following line only if agentInfo.disk_warning is not null

formatStr += "<li><b>" + diskList[i].mounted +"</b>:" + diskList[i].percent +"</li>";

that information comes from AgentStatusPoller

agentInfo['disk_warning'] = listDiskUsageOverThreshold(self.config, updateDB=True)

that uses df -klP (thank Hassan for spotting it!), parsing only the last two columns

df = subprocess.Popen(["df", "-klP"], stdout=subprocess.PIPE)

As Hassan described, df -klP from inside the docker container reports path of the directory from the container point of view.

Since we bind mount from the vm host into the docker container multiple directories that in the host are inside /data directory on the host, when df runs it selects one of the "destination" mountpoint that refer to the host filesystem/partition that it is reporting data about. For example, notice how df reports disk use on the host [0] (filesystem: /dev/vdb, mountpoint /data), while on the container the mountpoint is not consistent across different wmagents [1] and [2].

what can we do to solve it

So, this is an unexpected regression of deploying the WMCore with docker containers.

The cleanest solution that comes to my mind is to always identify a partition by the "filesystem" column and not by the "mounted on", chaning in Utilities.py

- diskPercent.append({'mounted': split[5], 'percent': split[4]})
+ diskPercent.append({'mounted': split[0], 'percent': split[4]})

so that we consistently report the "/data" directory with "/dev/vdb".

Otherwise, the internet suggests to bind mount the root director of the host in readonly [3], but i really dislike this approach.

In any case, I would like to hear a quick opinion from @todor-ivanov , who i trust with all these docker shenanigans :)


anatomy of df -klP

[0]

(on the host)

cmst1@vocms0281:dmapelli $ df -klP
Filesystem     1024-blocks      Used Available Capacity Mounted on
/dev/vda1        334914540  14111860 320802680       5% /
/dev/vda15          556948      7228    549720       2% /boot/efi
tmpfs              8388608   1069908   7318700      13% /mnt/ramdisk
/dev/vdb        1031911884 827451044 154616384      85% /data

[1]

(inside the container, vocms0281)

(WMAgent-2.3.4.4) [cmst1@vocms0281:current]$ df -klP
Filesystem     1024-blocks      Used Available Capacity Mounted on
overlay          334914540  14111860 320802680       5% /
/dev/vda1        334914540  14111860 320802680       5% /tmp
/dev/vdb        1031911884 827448800 154618628      85% /etc/group

[2]

(inside the container, vocms0256)

(WMAgent-2.3.4.4) [cmst1@vocms0256:current]$ df -klP
Filesystem     1024-blocks      Used Available Capacity Mounted on
overlay          334914540  11805236 323109304       4% /
/dev/vda1        334914540  11805236 323109304       4% /tmp
/dev/vdb        1031911884 772601688 209465740      79% /data/certs

[3] netdata/netdata#16444 (comment)

@todor-ivanov
Copy link
Contributor

todor-ivanov commented Oct 9, 2024

hi @mapellidario your explanation is complete and good.

About:

Otherwise, the internet suggests to bind mount the root director of the host in readonly [3], but i really dislike this approach.

I definitely would not want to enlarge with yet another item my list of things to regret about before I meet the Reaper.

About this though:

The cleanest solution that comes to my mind is to always identify a partition by the "filesystem" column and not by the "mounted on", chaning in Utilities.py

Your suggestion is one way to go, but in this case people will always see only the volume as reaching the threshold, which is indeed true and fair. Another way to go would be simply to visualize all mount points including the duplicate mounts:

in

df = subprocess.Popen(["df", "-klP"], stdout=subprocess.PIPE)

the following change should do the job:

- df = subprocess.Popen(["df", "-klP"], stdout=subprocess.PIPE) 
+ df = subprocess.Popen(["df", "-klPa"], stdout=subprocess.PIPE)

The choice should be up to whom ever takes the issue to work on. Both will do.

@amaltaro
Copy link
Contributor Author

amaltaro commented Oct 9, 2024

@mapellidario can you please take this issue on? I intent no pressure, but if we can have this development by the end of this week, we can already consider it in the upcoming WMAgent release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

4 participants