Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New connections to draining Runners #659

Open
mpass99 opened this issue Aug 18, 2024 · 3 comments
Open

New connections to draining Runners #659

mpass99 opened this issue Aug 18, 2024 · 3 comments
Labels
bug Something isn't working

Comments

@mpass99
Copy link
Contributor

mpass99 commented Aug 18, 2024

Our current drain_on_shutdown strategy for stopping Nomad agents is:

  • On Shutdown the Nomad agent gets ineligible and no new runners are being scheduled.
  • In the drain-on-shutdown deadline all running executions have time to finish.
    • ⚡ We still start new Executions in runners on the draining agent that may not have enough time to finish
  • After that the Nomad Agent shuts down

The executions that don't have enough time to finish result in a user-visible error.

We might need to "exclude" some runners for new executions as soon as the respective Nomad agent is about to shut down.

See #651


Unfortunately, we currently don't have any metric to count how often this issue occurs.

@mpass99 mpass99 added the bug Something isn't working label Aug 18, 2024
@MrSerth
Copy link
Member

MrSerth commented Aug 21, 2024

ToDo: Let's identify which error / log information / ... we get when above issue occurs.

@mpass99
Copy link
Contributor Author

mpass99 commented Sep 12, 2024

We've conducted a local reproduction of this scenario: nomadEventLog-ExecuteDraining.txt.
It shows that POSEIDON-3W (#590) with the sub-error the allocation was rescheduled indicates this error. This has not happened for at least 90 days.

If we consider a fix for this necessary in the future, we might consider listening to Nomad's Node events to receive drain updates, fetch all allocations of this node, and block new executions for these allocations/runners. Further, we should ensure that the drain deadline matches the maximum of all allowed execution timeouts (of CodeOcean).

@MrSerth
Copy link
Member

MrSerth commented Sep 25, 2024

This issue is still valid and could be a nice improvement. However, we don't expect that this problem occurs many times, so that it doesn't have a high priority.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants