-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Receptor Work unit expired #890
Comments
Ensure the time on AWX and Receptor node are in sync. |
@Iyappanj |
@kurokobo Yes, the time is in Sync. but still sometimes I could see this issue and then it will resolve by itself for few nodes. |
Ah sorry I misread the message as an error about the token expiration. |
Hi everyone, Similar to the OP, I'm encountering a similar issue on an execution node hosted on RHEL servers. It has been a moment since we deployed AWX in production, and this is the first time we've experienced an issue with the execution nodes. Problem descriptionI have many error in my ERROR 2024/01/11 09:46:21 Error locating unit: IbVMji5u
ERROR 2024/01/11 09:46:21 : unknown work unit IbVMji5u It's been awhile, and those errors never cause some issue but know we have more jobs running on AWX now. After a moment, I experience a timeout between the Wed Jan 10 15:56:24 UTC 2024
0: awx-task-79bc97cd7-mrb5g in 288.899µs
1: XX.XX.XX.XX in 1.808779ms
Wed Jan 10 15:56:25 UTC 2024
0: awx-task-79bc97cd7-mrb5g in 230.033µs
ERROR: 1: Error timeout from in 10.000210552s I can't figure out what could be causing this timeout. When my execution node switch from 'ready' to 'unavailable' state, Receptor error from XX.XX.XX.XX, detail:
Work unit expired on Thu Jan 10 16:03:12 At this moment, my only workaround is to restart the I've already checked some things:
@kurokobo or someone else, do you have an idea please ? I'm running out of idea here ... Additional informationExecution node VM information : NAME="Red Hat Enterprise Linux"
VERSION="8.9 (Ootpa)"
ID="rhel"
ID_LIKE="fedora"
VERSION_ID="8.9"
PLATFORM_ID="platform:el8"
... AWX information: Kubernetes version: v1.25.6
AWX version: 23.6.0
AWX Operator: 2.10.0
PostgreSQL version 15 Receptor information receptorctl 1.4.3
receptor v1.4.3 Ansible-runner version ansible-runner 2.3.4 Podman information: host:
arch: amd64
buildahVersion: 1.31.3
cgroupControllers: []
cgroupManager: cgroupfs
cgroupVersion: v1
conmon:
package: conmon-2.1.8-1.module+el8.9.0+20326+387084d0.x86_64
path: /usr/bin/conmon
version: 'conmon version 2.1.8, commit: 579be593361fffcf49b6c5ba4006f2075fd1f52d'
cpuUtilization:
idlePercent: 96.97
systemPercent: 0.4
userPercent: 2.63
cpus: 4
databaseBackend: boltdb
distribution:
distribution: '"rhel"'
version: "8.9"
eventLogger: file
freeLocks: 2046
hostname: XX.XX.XX.XX
idMappings:
gidmap:
- container_id: 0
host_id: 21000
size: 1
- container_id: 1
host_id: 1214112
size: 65536
uidmap:
- container_id: 0
host_id: 12007
size: 1
- container_id: 1
host_id: 1214112
size: 65536
kernel: 4.18.0-513.9.1.el8_9.x86_64
linkmode: dynamic
logDriver: k8s-file
memFree: 11834806272
memTotal: 16480423936
networkBackend: cni
networkBackendInfo:
backend: cni
dns:
package: podman-plugins-4.6.1-4.module+el8.9.0+20326+387084d0.x86_64
path: /usr/libexec/cni/dnsname
version: |-
CNI dnsname plugin
version: 1.3.1
commit: unknown
package: containernetworking-plugins-1.3.0-4.module+el8.9.0+20326+387084d0.x86_64
path: /usr/libexec/cni
ociRuntime:
name: crun
package: crun-1.8.7-1.module+el8.9.0+20326+387084d0.x86_64
path: /usr/bin/crun
version: |-
crun version 1.8.7
commit: 53a9996ce82d1ee818349bdcc64797a1fa0433c4
rundir: /tmp/podman-run-12007/crun
spec: 1.0.0
+SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +CRIU +YAJL
os: linux
pasta:
executable: ""
package: ""
version: ""
remoteSocket:
path: /tmp/podman-run-12007/podman/podman.sock
security:
apparmorEnabled: false
capabilities: CAP_NET_RAW,CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT
rootless: true
seccompEnabled: true
seccompProfilePath: /usr/share/containers/seccomp.json
selinuxEnabled: false
serviceIsRemote: false
slirp4netns:
executable: /usr/bin/slirp4netns
package: slirp4netns-1.2.1-1.module+el8.9.0+20326+387084d0.x86_64
version: |-
slirp4netns version 1.2.1
commit: 09e31e92fa3d2a1d3ca261adaeb012c8d75a8194
libslirp: 4.4.0
SLIRP_CONFIG_VERSION_MAX: 3
libseccomp: 2.5.2
swapFree: 4294963200
swapTotal: 4294963200
uptime: 18h 13m 25.00s (Approximately 0.75 days)
plugins:
authorization: null
log:
- k8s-file
- none
- passthrough
- journald
network:
- bridge
- macvlan
- ipvlan
volume:
- local
registries:
search:
- docker.io
store:
configFile: /home/awx/.config/containers/storage.conf
containerStore:
number: 2
paused: 0
running: 1
stopped: 1
graphDriverName: overlay
graphOptions:
overlay.mount_program:
Executable: /usr/bin/fuse-overlayfs
Package: fuse-overlayfs-1.12-1.module+el8.9.0+20326+387084d0.x86_64
Version: |-
fusermount3 version: 3.3.0
fuse-overlayfs: version 1.12
FUSE library version 3.3.0
using FUSE kernel interface version 7.26
graphRoot: /home/awx/.local/share/containers/storage
graphRootAllocated: 110867910656
graphRootUsed: 10662989824
graphStatus:
Backing Filesystem: extfs
Native Overlay Diff: "false"
Supports d_type: "true"
Using metacopy: "false"
imageCopyTmpDir: /var/tmp
imageStore:
number: 13
runRoot: /tmp/podman-run-12007/containers
transientStore: false
volumePath: /home/awx/.local/share/containers/storage/volumes
version:
APIVersion: 4.6.1
Built: 1700309421
BuiltTime: Sat Nov 18 12:10:21 2023
GitCommit: ""
GoVersion: go1.19.13
Os: linux
OsArch: linux/amd64
Version: 4.6.1 |
The issue is not related to : ERROR 2024/01/11 09:46:21 Error locating unit: IbVMji5u
ERROR 2024/01/11 09:46:21 : unknown work unit IbVMji5u I tried to disable the cleanup from AWX and do it on my side and I don't have this error anymore but my execution node continue to timeout randomly. |
Hi I have the same issue but it is following red hat update on the execution node from 8.8 to latest 8.8 kernel No solution ? Can you advise @koro Thanks for your support |
Similar topic: #934 Could anyone here who facing this issue share your |
+1, have the same issue. |
I ran into this on Friday. My last job was one with id
The failed job ran on aap-1 and I see this in the messages at about that time:
However this is not the only instance of that error. Please find attached the logs:
Please accept my apologies for the fact that some log lines are duplicated in the AWX task logs, this is because I can only download them 500 messages at a time from Google's logging console. AWX is running inside a Google Kubernetes Engine cluster while aap-0 and app-1 are running on RHEL 9 VMs inside Google compute engine. Here is a screen clip of the topology screen screen for my cluster per comment from @kurokobo : ^^^ Note that this is after restarting the awx-task deployment so the awx task node has changed its id. The podman version on aap-1 is 3.4.4. I wonder if I upgrade to something where containers/conmon#440 had been fixed, I wouldn't see this again? |
I've had similar issues and downgrading receptor to version 1.4.2 seems to solve it somehow. |
Recently we see one of our receptor node showing unavailable on AWX and we see the error below
Receptor error from XX.XX.XX.XX, detail:
Work unit expired on Mon Oct 30 12:04:34
Restart of the receptor service did not fix the issue. Any idea on what is causing this ?
The text was updated successfully, but these errors were encountered: