Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Receptor Work unit expired #890

Open
Iyappanj opened this issue Oct 30, 2023 · 13 comments
Open

Receptor Work unit expired #890

Iyappanj opened this issue Oct 30, 2023 · 13 comments

Comments

@Iyappanj
Copy link

Recently we see one of our receptor node showing unavailable on AWX and we see the error below

Receptor error from XX.XX.XX.XX, detail:
Work unit expired on Mon Oct 30 12:04:34

Restart of the receptor service did not fix the issue. Any idea on what is causing this ?

@kurokobo
Copy link
Contributor

Ensure the time on AWX and Receptor node are in sync.

@djyasin
Copy link
Member

djyasin commented Nov 22, 2023

@Iyappanj Were you able to resolve this with @kurokobo's feedback? Are you still encountering difficulties?

@Iyappanj
Copy link
Author

@djyasin @kurokobo I still see the issue even when the time zone is same on both

@kurokobo
Copy link
Contributor

@Iyappanj
It is not a time zone issue; make sure both AWX and Receptor node are synchronized with an NTP server that provides accurate time.

@Iyappanj
Copy link
Author

Iyappanj commented Dec 7, 2023

@kurokobo Yes, the time is in Sync. but still sometimes I could see this issue and then it will resolve by itself for few nodes.

@kurokobo
Copy link
Contributor

kurokobo commented Dec 7, 2023

Ah sorry I misread the message as an error about the token expiration.
Are there any logs on awx-ee container and awx-task container in awx-task pod?

@LalosBastien
Copy link

Hi everyone,

Similar to the OP, I'm encountering a similar issue on an execution node hosted on RHEL servers. It has been a moment since we deployed AWX in production, and this is the first time we've experienced an issue with the execution nodes.

Problem description

I have many error in my receptor.log similar to this one :

ERROR 2024/01/11 09:46:21 Error locating unit: IbVMji5u
ERROR 2024/01/11 09:46:21 : unknown work unit IbVMji5u

It's been awhile, and those errors never cause some issue but know we have more jobs running on AWX now.

After a moment, I experience a timeout between the awx-task pod (awx-ee container) and my execution node.
Here is the result of the receptorctl traceroute command from awx-ee container :

Wed Jan 10 15:56:24 UTC 2024
0: awx-task-79bc97cd7-mrb5g in 288.899µs
1: XX.XX.XX.XX in 1.808779ms

Wed Jan 10 15:56:25 UTC 2024
0: awx-task-79bc97cd7-mrb5g in 230.033µs
ERROR: 1: Error timeout from  in 10.000210552s

I can't figure out what could be causing this timeout.

When my execution node switch from 'ready' to 'unavailable' state, awx-task cannot peer my execution node anymore and I encounter this error when attempting to run a health check:

Receptor error from XX.XX.XX.XX, detail:
Work unit expired on Thu Jan 10 16:03:12

At this moment, my only workaround is to restart the awx-task pod when one of my execution node is lost.

I've already checked some things:

  • NTP synchronized
  • Firewall (rate-limiting, bruteforce, etc)
  • MaxLogFile
  • File descriptor
  • Reinstallation

@kurokobo or someone else, do you have an idea please ? I'm running out of idea here ...

Additional information

Execution node VM information :

NAME="Red Hat Enterprise Linux"
VERSION="8.9 (Ootpa)"
ID="rhel"
ID_LIKE="fedora"
VERSION_ID="8.9"
PLATFORM_ID="platform:el8"
...

AWX information:

Kubernetes version: v1.25.6
AWX version: 23.6.0
AWX Operator: 2.10.0
PostgreSQL  version 15

Receptor information

receptorctl  1.4.3
receptor     v1.4.3

Ansible-runner version

ansible-runner 2.3.4

Podman information:

host:
  arch: amd64
  buildahVersion: 1.31.3
  cgroupControllers: []
  cgroupManager: cgroupfs
  cgroupVersion: v1
  conmon:
    package: conmon-2.1.8-1.module+el8.9.0+20326+387084d0.x86_64
    path: /usr/bin/conmon
    version: 'conmon version 2.1.8, commit: 579be593361fffcf49b6c5ba4006f2075fd1f52d'
  cpuUtilization:
    idlePercent: 96.97
    systemPercent: 0.4
    userPercent: 2.63
  cpus: 4
  databaseBackend: boltdb
  distribution:
    distribution: '"rhel"'
    version: "8.9"
  eventLogger: file
  freeLocks: 2046
  hostname: XX.XX.XX.XX
  idMappings:
    gidmap:
    - container_id: 0
      host_id: 21000
      size: 1
    - container_id: 1
      host_id: 1214112
      size: 65536
    uidmap:
    - container_id: 0
      host_id: 12007
      size: 1
    - container_id: 1
      host_id: 1214112
      size: 65536
  kernel: 4.18.0-513.9.1.el8_9.x86_64
  linkmode: dynamic
  logDriver: k8s-file
  memFree: 11834806272
  memTotal: 16480423936
  networkBackend: cni
  networkBackendInfo:
    backend: cni
    dns:
      package: podman-plugins-4.6.1-4.module+el8.9.0+20326+387084d0.x86_64
      path: /usr/libexec/cni/dnsname
      version: |-
        CNI dnsname plugin
        version: 1.3.1
        commit: unknown
    package: containernetworking-plugins-1.3.0-4.module+el8.9.0+20326+387084d0.x86_64
    path: /usr/libexec/cni
  ociRuntime:
    name: crun
    package: crun-1.8.7-1.module+el8.9.0+20326+387084d0.x86_64
    path: /usr/bin/crun
    version: |-
      crun version 1.8.7
      commit: 53a9996ce82d1ee818349bdcc64797a1fa0433c4
      rundir: /tmp/podman-run-12007/crun
      spec: 1.0.0
      +SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +CRIU +YAJL
  os: linux
  pasta:
    executable: ""
    package: ""
    version: ""
  remoteSocket:
    path: /tmp/podman-run-12007/podman/podman.sock
  security:
    apparmorEnabled: false
    capabilities: CAP_NET_RAW,CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT
    rootless: true
    seccompEnabled: true
    seccompProfilePath: /usr/share/containers/seccomp.json
    selinuxEnabled: false
  serviceIsRemote: false
  slirp4netns:
    executable: /usr/bin/slirp4netns
    package: slirp4netns-1.2.1-1.module+el8.9.0+20326+387084d0.x86_64
    version: |-
      slirp4netns version 1.2.1
      commit: 09e31e92fa3d2a1d3ca261adaeb012c8d75a8194
      libslirp: 4.4.0
      SLIRP_CONFIG_VERSION_MAX: 3
      libseccomp: 2.5.2
  swapFree: 4294963200
  swapTotal: 4294963200
  uptime: 18h 13m 25.00s (Approximately 0.75 days)
plugins:
  authorization: null
  log:
  - k8s-file
  - none
  - passthrough
  - journald
  network:
  - bridge
  - macvlan
  - ipvlan
  volume:
  - local
registries:
  search:
  - docker.io
store:
  configFile: /home/awx/.config/containers/storage.conf
  containerStore:
    number: 2
    paused: 0
    running: 1
    stopped: 1
  graphDriverName: overlay
  graphOptions:
    overlay.mount_program:
      Executable: /usr/bin/fuse-overlayfs
      Package: fuse-overlayfs-1.12-1.module+el8.9.0+20326+387084d0.x86_64
      Version: |-
        fusermount3 version: 3.3.0
        fuse-overlayfs: version 1.12
        FUSE library version 3.3.0
        using FUSE kernel interface version 7.26
  graphRoot: /home/awx/.local/share/containers/storage
  graphRootAllocated: 110867910656
  graphRootUsed: 10662989824
  graphStatus:
    Backing Filesystem: extfs
    Native Overlay Diff: "false"
    Supports d_type: "true"
    Using metacopy: "false"
  imageCopyTmpDir: /var/tmp
  imageStore:
    number: 13
  runRoot: /tmp/podman-run-12007/containers
  transientStore: false
  volumePath: /home/awx/.local/share/containers/storage/volumes
version:
  APIVersion: 4.6.1
  Built: 1700309421
  BuiltTime: Sat Nov 18 12:10:21 2023
  GitCommit: ""
  GoVersion: go1.19.13
  Os: linux
  OsArch: linux/amd64
  Version: 4.6.1

@LalosBastien
Copy link

The issue is not related to :

ERROR 2024/01/11 09:46:21 Error locating unit: IbVMji5u
ERROR 2024/01/11 09:46:21 : unknown work unit IbVMji5u

I tried to disable the cleanup from AWX and do it on my side and I don't have this error anymore but my execution node continue to timeout randomly.

@birb57
Copy link

birb57 commented Feb 12, 2024

Hi

I have the same issue but it is following red hat update on the execution node from 8.8 to latest 8.8 kernel

No solution ?

Can you advise @koro

Thanks for your support

@kurokobo
Copy link
Contributor

Similar topic: #934

Could anyone here who facing this issue share your Administration > Topology View screen and receptor logs from both control nodes and execution nodes?

@dchittibala
Copy link

+1, have the same issue.

@heretic098
Copy link

heretic098 commented Mar 25, 2024

I ran into this on Friday. My last job was one with id 11464, which shows as "error" in the graphical user interface and has the message No output found for this job. displayed. All subsequent jobs have then failed, it finished at 22/03/2024, 14:13:10 according to the GUI. The thing that stands out for me is the insertId losg0lm7agp9jyem:

WARNING 2024/03/22 14:11:13 Could not close connection: close unix /var/run/receptor/receptor.sock->@: use of closed network connection

The failed job ran on aap-1 and I see this in the messages at about that time:

Mar 22 14:10:47 aap-1 conmon[3350942]: conmon 46f779b515a9eef4b1d5 <nwarn>: stdio_input read failed Input/output error
Mar 22 14:12:52 aap-1 conmon[3351784]: conmon 90d76f3492c4764bfef0 <nwarn>: stdio_input read failed Input/output error

However this is not the only instance of that error.

Please find attached the logs:

  • json formatted logging from the awx pod awx-task-b6ff7d555-b7lt5
  • receptor log from aap-0
  • /var/log/messages from aap-0
  • receptor log from aap-1
  • /var/log/messages from aap-1

Please accept my apologies for the fact that some log lines are duplicated in the AWX task logs, this is because I can only download them 500 messages at a time from Google's logging console. AWX is running inside a Google Kubernetes Engine cluster while aap-0 and app-1 are running on RHEL 9 VMs inside Google compute engine.

logs.tar.gz

Here is a screen clip of the topology screen screen for my cluster per comment from @kurokobo :

image

^^^ Note that this is after restarting the awx-task deployment so the awx task node has changed its id.

The podman version on aap-1 is 3.4.4. I wonder if I upgrade to something where containers/conmon#440 had been fixed, I wouldn't see this again?

@rexberg
Copy link

rexberg commented Nov 24, 2024

I've had similar issues and downgrading receptor to version 1.4.2 seems to solve it somehow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants