-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix 1.21 regression: GET_32G_MAX_CONCURRENT + mixed prepared/executing leads to stuck scheduler #10633
Conversation
f7a5409
to
7040db0
Compare
…g leads to stuck scheduler If you have 12 GET tasks and GET_32G_MAX_CONCURRENT=1, sealing jobs will only show assigned tasks for GET of the miner and is stuck. I believe this to be a regression of 1.21 unifying the counters, in the case of GETs where PrepType and TaskType both being seal/v0/fetch leading to a state where tasks are blocked since already counted towards the limit.
7040db0
to
7cb8172
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great catch!
I believe this is not counting quite correctly, but the fix looks good otherwise.
tc.lk.Lock() | ||
defer tc.lk.Unlock() | ||
tc.taskCounters[tt]-- | ||
delete(tc.getUnlocked(tt), schedID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I believe this has a bug which makes it not work as we'd expect:
- First we call .Add for preparing
- Then we call .Add on active in .withResources
- In .withResources we call .Free on preparing
- Now the task runs - and we don't count the resources
- Then we call .Free on preparing
This can be fixed by swapping map[sealtasks.SealTaskType]map[uuid.UUID]bool
for map[sealtasks.SealTaskType]map[uuid.UUID]int
and count how many times Add/Free was called.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah its allowing more than it should as-is.
Trying that and counting +1/-1 for each Add/Free invocation leads to wrong stuck behavior again:
ID Sector Worker Hostname Task State Time
00000000 23 4154622e x GET assigned(1) 21.2s
00000000 24 4154622e x GET assigned(2) 21.2s
Thats interesting, since only reasonable idea I have is that there's an invocation missing somewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@magik6k
Apparently the issue is the last Free for a SchedId happens after the last WINDOW assignPreparingWork
(which in turn says not scheduling on worker for startPreparing
).
A quick test of returning update=true
in schedWorker::waitForUpdates
to force a regular invocation of that leads to the "GET backlog" getting fully resolved over time as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@magik6k Any update on this? This is now broken in 1.21, 1.22 and 1.23.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
only reasonable idea I have is that there's an invocation missing somewhere.
How did you implement Free
with the counter? Did you delete
from the map when the counter reached zero? If not the tasks := a.taskCounters.Get(tt) [...] if len(tasks) >= needRes.MaxConcurrent
below would not work correctly as it still counted the zero enties.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@magik6k Yeah I've accounted for that:
steffengy@a3b7ec2
(I think else it also wouldnt work with making "scheduler entry" more often by returning update=true
in schedWorker::waitForUpdates
as that just evaluates the same at a later time; so its a timing/ordering issue last Free happening after the next scheduler entry for startPreparing and stuck there until next scheduling happening)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great! Yeah in a quick test that looks working and fix-wise it seems plausible that that was the issue.
Closing this PR in favor of yours.
@@ -416,7 +416,7 @@ assignLoop: | |||
} | |||
|
|||
needRes := worker.Info.Resources.ResourceSpec(todo.Sector.ProofType, todo.TaskType) | |||
if worker.active.CanHandleRequest(todo.SealTask(), needRes, sw.wid, "startPreparing", worker.Info) { | |||
if worker.active.CanHandleRequest(todo.SchedId, todo.SealTask(), needRes, sw.wid, "startPreparing", worker.Info) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not related to this PR, but should be
if worker.active.CanHandleRequest(todo.SchedId, todo.SealTask(), needRes, sw.wid, "startPreparing", worker.Info) { | |
if worker.active.CanHandleRequest(todo.SchedId, todo.SealTask(), needRes, sw.wid, "startReady", worker.Info) { |
The Lotus Team is actively working on this issue and will provide an update ASAP! 🙏 |
* Fix 1.21 regression: GET_32G_MAX_CONCURRENT + mixed prepared/executing leads to stuck scheduler If you have 12 GET tasks and GET_32G_MAX_CONCURRENT=1, sealing jobs will only show assigned tasks for GET of the miner and is stuck. I believe this to be a regression of 1.21 unifying the counters, in the case of GETs where PrepType and TaskType both being seal/v0/fetch leading to a state where tasks are blocked since already counted towards the limit. * itests: Repro issue from PR #10633 * make counters int (non-working) * fix: worker sched: Send taskDone notifs after tasks are done * itests: Make TestPledgeMaxConcurrentGet actually reproduce the issue * make the linter happy --------- Co-authored-by: Steffen Butzer <[email protected]>
* Fix 1.21 regression: GET_32G_MAX_CONCURRENT + mixed prepared/executing leads to stuck scheduler If you have 12 GET tasks and GET_32G_MAX_CONCURRENT=1, sealing jobs will only show assigned tasks for GET of the miner and is stuck. I believe this to be a regression of 1.21 unifying the counters, in the case of GETs where PrepType and TaskType both being seal/v0/fetch leading to a state where tasks are blocked since already counted towards the limit. * itests: Repro issue from PR #10633 * make counters int (non-working) * fix: worker sched: Send taskDone notifs after tasks are done * itests: Make TestPledgeMaxConcurrentGet actually reproduce the issue * make the linter happy --------- Co-authored-by: Steffen Butzer <[email protected]>
!!!!
Superseded by PR #10850
!!!!
If you have 12 GET tasks and GET_32G_MAX_CONCURRENT=1, sealing jobs will only show assigned tasks for GET of the miner and is stuck.
I believe this to be a regression of 1.21 unifying the counters, in the case of GETs where PrepType and TaskType both being seal/v0/fetch leading to a state where tasks are blocked since already counted towards the limit.
So while #9407 is now enforced, overall broken state.
More like a draft PR I've confirmed to seem to work, if you have a better way to fix, let me know.
This works by not counting the same SchedId twice towards the limit.
@magik6k @rjan90
While as seen in below discussion the patch isnt exactly 100% right yet, its closer to previous behavior and atleast doesnt lead to not working at all / stuck scheduler.