Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

loading seccomp filter: invalid argument #2865

Closed
areed opened this issue Mar 19, 2021 · 11 comments
Closed

loading seccomp filter: invalid argument #2865

areed opened this issue Mar 19, 2021 · 11 comments

Comments

@areed
Copy link

areed commented Mar 19, 2021

We're seeing machines with several runc init processes blocked writing the same message to stderr:

$ ps -ef | grep '[r]unc init' | awk '{ print $2 }' | xargs -I'{}' sudo strace -p '{}' -s256 -e write
strace: Process 32329 attached
write(2, "standard_init_linux.go:207: init seccomp caused: error loading seccomp filter into kernel: loading seccomp filter: invalid argument\n", 132) = 132
+++ exited with 1 +++
strace: Process 32453 attached
write(2, "standard_init_linux.go:207: init seccomp caused: error loading seccomp filter into kernel: loading seccomp filter: invalid argument\n", 132) = 132
+++ exited with 1 +++
strace: Process 32559 attached
write(2, "standard_init_linux.go:207: init seccomp caused: error loading seccomp filter into kernel: loading seccomp filter: invalid argument\n", 132) = 132
+++ exited with 1 +++

This appears to cause a chain reaction on Kubernetes nodes where a lock acquired during docker start for the pause container of a Pod blocks PLEG and the node flaps between Ready and NotReady.

@cpuguy83
Copy link
Contributor

Seeing this too with containerd. I can repro easily in an AKS cluster with rc93. rc92 works just fine.

Calls look like this:

[pid 60015] seccomp(SECCOMP_SET_MODE_FILTER, SECCOMP_FILTER_FLAG_LOG, NULL) = -1 EFAULT (Bad address)
[pid 60018] <... nanosleep resumed> NULL) = 0
[pid 60015] seccomp(SECCOMP_GET_ACTION_AVAIL, 0, [SECCOMP_RET_LOG] <unfinished ...>
[pid 60018] nanosleep({tv_sec=0, tv_nsec=20000},  <unfinished ...>
[pid 60015] <... seccomp resumed> )     = 0
[pid 60015] prctl(PR_SET_SECCOMP, ...
[pid 60018] <... nanosleep resumed> NULL) = 0
[pid 60018] nanosleep({tv_sec=0, tv_nsec=20000},  <unfinished ...>
[pid 60015] <... prctl resumed> )       = -1 EINVAL (Invalid argument)
[pid 60018] <... nanosleep resumed> NULL) = 0
[pid 60018] nanosleep({tv_sec=0, tv_nsec=20000},  <unfinished ...>
[pid 60015] write(2, "standard_init_linux.go:207: init seccomp caused: error loading seccomp filter into kernel: loading seccomp filter: invalid argum"..., 132) = 132

I truncated the PR_SET_SECCOMP b/c it is super long.

@cpuguy83
Copy link
Contributor

Worth noting, runc init is blocked for me until I attempt to strace it, then it exits with that error.
I've also tried modifying it to dump a stack trace to disk for me with SIGUSR1 and that also makes it exit, and no stack dump.

@cpuguy83
Copy link
Contributor

Bisected to 7a8d716

@cpuguy83
Copy link
Contributor

/cc @cyphar

@cyphar
Copy link
Member

cyphar commented Mar 30, 2021

I believe #2871 fixes this.

cpuguy83 added a commit to cpuguy83/containerd that referenced this issue Mar 30, 2021
There is a bug with rc93 that may be causing CI instability
(opencontainers/runc#2865).
Downgrading to rc92 to see if we get better runs.

Signed-off-by: Brian Goff <[email protected]>
@wu0407
Copy link

wu0407 commented Mar 31, 2021

maybe same issue containerd/containerd#5280

@oppianmatt
Copy link

oppianmatt commented Apr 1, 2021

we get this error pretty consistently on 2 machines running ubuntu 16.04

downgrading containerd.io to 1.4.3 doesn't fix it as it seems that still uses runc93 and not 92

image

it was plaguing us for weeks, sometimes locking the system up requiring a reboot was the only fix, and deploys were getting stuck

we found in the end, after reading this thread, that to unstick them is to run:

ps aux |grep "runc init"| grep -v grep | awk "{print\$2}" | xargs -r -n 1 -t strace -p

that will strace each one until they exit, and then the system works

@kolyshkin
Copy link
Contributor

This might be fixed by #2871, which was just merged

@oppianmatt @wu0407 @areed @cpuguy83 can you please test the runc tip and report back if the bug is fixed?

@oppianmatt
Copy link

we were only experiencing this on prod, until today that is. I tried removing an image from our staging server and it got stuck. So using it as an opportunity to test.

Downloaded runc 92 and replaced the binary on the system and restarted containerd service and docker service.

When restarting docker in this state we see huge stackdumps in the log all about mutex locks

can see some were quite stuck for a while

image

image

so much dumping going into syslog that the logs are 5 minutes behind

after that though can remove the stuck containers

of course won't know for sure until I get another stuck instance, if I do I will report right back, so take no news as good news

@cpuguy83
Copy link
Contributor

cpuguy83 commented Apr 2, 2021

@kolyshkin Verified both by cherry-picking that commit onto rc93 and testing HEAD directly.

@cyphar
Copy link
Member

cyphar commented Apr 3, 2021

Okay, so it looks like it's fixed then. Please comment if this still isn't fixed after testing on tip. We will do a new release soon.

@cyphar cyphar closed this as completed Apr 3, 2021
haslersn added a commit to stuvusIT/ansible_containerd that referenced this issue Apr 16, 2021
haslersn added a commit to stuvusIT/ansible_containerd that referenced this issue Apr 16, 2021
fabianhick pushed a commit to stuvusIT/ansible_containerd that referenced this issue Apr 16, 2021
Signed-off-by: Sebastian Hasler <[email protected]>

Co-authored-by: Sebastian Hasler <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants