Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WALA/waagent/wa-linux-agent issues after OEM switch #1352

Open
3 tasks
jepio opened this issue Feb 8, 2024 · 2 comments
Open
3 tasks

WALA/waagent/wa-linux-agent issues after OEM switch #1352

jepio opened this issue Feb 8, 2024 · 2 comments
Labels
kind/bug Something isn't working platform/Azure

Comments

@jepio
Copy link
Member

jepio commented Feb 8, 2024

Description

Several issues have been reported by image-builder (kubernetes-sigs/image-builder#1395) and WALA agent team, some of them overlap. I will summarize here:

  • bug in our downstream patch
sudo bash -c '/usr/sbin/waagent -force -deprovision+user && ln -sf ../run/systemd/resolve/resolv.conf /etc/resolv.conf && sync'
    WARNING! The waagent service will be stopped.
    WARNING! Cached DHCP leases will be deleted.
    WARNING! /etc/resolv.conf will be deleted.
    WARNING! packer account and entire home directory will be deleted.
    WARNING! /etc/machine-id will be removed.
Traceback (most recent call last):
  File "/usr/lib/python3.11/site-packages/azurelinuxagent/agent.py", line 263, in main
    agent.deprovision(force, deluser=True)
  File "/usr/lib/python3.11/site-packages/azurelinuxagent/agent.py", line 155, in deprovision
    deprovision_handler.run(force=force, deluser=deluser)
  File "/usr/lib/python3.11/site-packages/azurelinuxagent/pa/deprovision/default.py", line 221, in run
    self.do_actions(actions)
  File "/usr/lib/python3.11/site-packages/azurelinuxagent/pa/deprovision/default.py", line 241, in do_actions
    action.invoke()
  File "/usr/lib/python3.11/site-packages/azurelinuxagent/pa/deprovision/default.py", line 57, in invoke
    self.func(*self.args, **self.kwargs)
  File "/usr/lib/python3.11/site-packages/azurelinuxagent/common/osutil/default.py", line 1342, in del_account
    if self.is_sys_user(username):
       ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/site-packages/azurelinuxagent/common/osutil/coreoscommon.py", line 29, in is_sys_user
    return super(CoreOSUtil, self).is_sys_user(username)
                 ^^^^^^^^^^
NameError: name 'CoreOSUtil' is not defined

This needs to be super(CoreosCommonUtil).is_sys_user(username) or better yet: our patch upstreamed.

  • python 3.11 is not tested by upstream
==>
During handling of the above exception, another exception occurred:
==> azure-arm.sig-flatcar:
Traceback (most recent call last):
  File "/usr/sbin/waagent", line 39, in <module>
    agent.main()
  File "/usr/lib/python3.11/site-packages/azurelinuxagent/agent.py", line 283, in main
    textutil.format_exception(e))
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/site-packages/azurelinuxagent/common/utils/textutil.py", line 448, in format_exception
    msg += ''.join(traceback.format_exception(etype=type(exception), value=exception, tb=tb))
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: format_exception() got an unexpected keyword argument 'etype'

Tricky. I don't think it would be easy to switch back to python3.10 for Azure OEM?

  • waagent upstream still expects to find waagent.conf in /usr/share/oem, not /etc.
    from @narrieta
Our automation reported this issue today. It seems like in recent Flatcar images, waagent.conf is located under /etc. Current versions of the Agent are coded to look for it under /usr/share/oem

     https://github.com/Azure/WALinuxAgent/blob/master/azurelinuxagent/common/osutil/factory.py#L90
     https://github.com/Azure/WALinuxAgent/blob/master/azurelinuxagent/common/osutil/coreos.py#L28

By default, AutoUpdate.Enabled is disabled in Flatcar, but if one enables it, the Agent ends up completely broken.

When the Agent installed on Flatcar, version 2.6, does the update, the new version crashes because it cannot find waagent.conf and then 2.6 also crashes because when it is trying to handle the error from the update it ends up using an API that is no longer on Python 3.11.

Is the location for waagent.conf going to be /etc permanently? I can update the location in our code.

As far as the Python version, the Agent is not fully tested on Python 3.10+, so there may be other issues lying around. I'll try to run the code thru some tools to see if I find other issues.

We should try this upgrade path on flatcar 3760 and <3760. Can we add a compatibility symlink from /usr/share/oem/waagent.conf -> /etc/waagent.conf for newer Flatcar? Upstreaming our patch will also help.

Also hits the same issue as above.

Impact

[ 1 sentence detailing the impact this bug is creating for you ]

Environment and steps to reproduce

  1. Set-up: [ describe the environment Flatcar/Lokomotive/Nebraska etc was running in when encountering the bug; Platform etc. ]
  2. Task: [ describe the task performing when encountering the bug ]
  3. Action(s): [ sequence of actions that triggered the bug, see example below ]
    a. [ requested the start of a new pod or container ]
    b. [ container image downloaded ]
  4. Error: [describe the error that was triggered]

Expected behavior

[ describe what you expected to happen at 4. above but instead got an error ]

Additional information

Please add any information here that does not fit the above format.

@jepio
Copy link
Member Author

jepio commented Feb 8, 2024

@krnowak are you able to deal with this?

@krnowak
Copy link
Member

krnowak commented Feb 8, 2024

I'll have a look.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working platform/Azure
Projects
Status: 📝 Needs Triage
Development

No branches or pull requests

2 participants