-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FileLog Pipeline Not Closed, Causing runc exec to Hang #4141
Comments
To add to your analysis
runc/libcontainer/setns_init_linux.go Lines 132 to 136 in 99f7fa1
which happens right before exec'ing the binary supplied to So, what you see is probably caused by the fact that this code was not reached. Now, I think that you assume that PID685940 ( If you can repro this, can you look into:
|
@ruijzhan ideally, if you can find a way to repro this with bare runc, we can take a deeper look. |
Description
FileLog Pipeline Not Closed, Causing runc exec to Hang
1. Problem Phenomenon and Impact
In the node host system, there are long-standing runc exec processes that have not exited. The parent process is containerd-shim, and the child process is etcdctl. The child process is already defunct, becoming a zombie process.
By using commands like
docker ps -a
, we identified the stuck container ID and matched it with the Pod in the cluster. We found that the issue occurs when kubelet executes binary files in the container namespace through OCI calling runc exec during exec type liveness or readiness probes. Runc does not exit and does not reclaim the child process's PID.This problem has been present from the release of runc 1.0.2 two years ago to the latest version 1.1.10. It occasionally causes the cluster's Docker to trigger PLEG, leading to Node NotReady status.
2. Debugging Information
Debugging of the runc process was conducted using dlv attach.
Two non-system goroutines were identified. One was blocked on channel reading, and the other was blocked on file reading.
Detailed information about the file descriptor was obtained, indicating that the file was not closed.
3. Code Analysis
The channel read is blocked on the reading of logsDone, which is causing the entire runc exec to be blocked and unable to exit.
The function
parent.forwardChildLogs()
is implemented to returnlogs.ForwardLogs(p.logFilePair.parent)
.In the implementation of
newParentProcess()
,p.logFilePair.parent
is declared.The implementation of
ForwardLogs()
shows a goroutine blocked in a loop atfor s.Scan()
. The commanddone <- s.Err()
is not reached, causing the initial channel read to be blocked.The purpose of these codes is to set the namespace for the process that runc exec is about to execute. The logs forwarded by logPipe are also the output of the setns command, but for some unknown reason, the pipeline was not closed after the setns execution was completed. There is also no explicit closure of
p.logFilePair.parent
in the entire code.4. Temporary Solution
Since
p.logFilePair.parent
was not closed, resulting in runc being blocked, we close it after a minute of blocking. A minute is significantly longer than the 5-second timeoutSeconds of prob, so it will not affect the probe results.Steps to reproduce the issue
Describe the results you received and expected
I want the logFile pipe is always closed and not block runc exec
What version of runc are you using?
from 1.0.2 to 1.1.10
Host OS information
No response
Host kernel information
Linux yCCS-vmMXiz6CnH 5.4.224-1.el7.elrepo.x86_64 #1 SMP Tue Nov 8 17:24:56 EST 2022 x86_64 x86_64 x86_64 GNU/Linux
The text was updated successfully, but these errors were encountered: