-
Notifications
You must be signed in to change notification settings - Fork 616
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
orchestrator: Avoid restarting a task that has already been restarted #1305
Conversation
When I was debugging integration test failures a week ago, I dug deep into the recent restart supervisor changes, and noticed a possible race which is almost impossible in practice. This is the sequence of events that has to happen: 1) Restart task A -> new task B. 2) B fails before the restart delay has elapsed (pull failure): Restart(B) starts waiter goroutine. 3) B's node fails at exact moment after the restart delay finishes. The orchestrator calls Restart(B) again, this time there's no delay in progress so Restart proceeds to set B.DesiredState = SHUTDOWN and create a new task. 4) The waiter goroutine from step 2 is scheduled. It calls Restart(B) to resume the first restart attempt. This sets B.DesiredState = SHUTDOWN (which means no change) and creates a new task. We'd end up with an extra task. Again, this would be almost impossible to trigger, but I'm fixing it for the sake of correctness. The general principle here is that Restart should never been called on a task that already has DesiredState > RUNNING, since what Restart does is "shut down this task and replace it with a new one". I also added a sanity check to Restart. I'm not sure this is really helpful because returning an error probably does more harm than good in this case. But at least it would cause an error message to be logged. Signed-off-by: Aaron Lehmann <[email protected]>
Current coverage is 54.85% (diff: 0.00%)@@ master #1305 diff @@
==========================================
Files 78 78
Lines 12466 12470 +4
Methods 0 0
Messages 0 0
Branches 0 0
==========================================
- Hits 6867 6841 -26
- Misses 4659 4677 +18
- Partials 940 952 +12
|
LGTM |
There's another way this could be triggered:
I think this explains moby/moby#27382 We should probably backport this to 1.12.3. |
@aaronlehmann thanks for looking into this; can you open a tracking issue? I think it'll depend on the amount of risk / code change to backport it |
Very low risk - take a look at the code changes. It adds some sanity checks that were missing before. Can we use moby/moby#27382 as the tracking issue? |
@aaronlehmann 😊 didn't look at the code, yes, that looks fairly low risk I added moby/moby#27382 to the 1.12.3 milestone for docker |
When I was debugging integration test failures a week ago, I dug deep
into the recent restart supervisor changes, and noticed a possible race
which is almost impossible in practice.
This is the sequence of events that has to happen:
Restart task A -> new task B.
B fails before the restart delay has elapsed (pull failure):
Restart(B) starts waiter goroutine.
B's node fails at exact moment after the restart delay finishes. The
orchestrator calls Restart(B) again, this time there's no delay in
progress so Restart proceeds to set B.DesiredState = SHUTDOWN and create
a new task.
The waiter goroutine from step 2 is scheduled. It calls Restart(B) to
resume the first restart attempt. This sets B.DesiredState = SHUTDOWN
(which means no change) and creates a new task.
We'd end up with an extra task.
Again, this would be almost impossible to trigger, but I'm fixing it for
the sake of correctness. The general principle here is that Restart
should never been called on a task that already has DesiredState >
RUNNING, since what Restart does is "shut down this task and replace it
with a new one".
I also added a sanity check to Restart. I'm not sure this is really
helpful because returning an error probably does more harm than good in
this case. But at least it would cause an error message to be logged.
cc @dongluochen