-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scheduler enhancements #7703
Scheduler enhancements #7703
Conversation
Codecov Report
@@ Coverage Diff @@
## master #7703 +/- ##
==========================================
+ Coverage 38.83% 39.56% +0.72%
==========================================
Files 638 640 +2
Lines 68122 68325 +203
==========================================
+ Hits 26456 27031 +575
+ Misses 37126 36654 -472
- Partials 4540 4640 +100
Continue to review full report at Codecov.
|
@@ -58,7 +58,7 @@ var ( | |||
FullAPIVersion1 = newVer(2, 1, 0) | |||
|
|||
MinerAPIVersion0 = newVer(1, 2, 0) | |||
WorkerAPIVersion0 = newVer(1, 1, 0) | |||
WorkerAPIVersion0 = newVer(1, 5, 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why jump so big
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Testing with existing miners. Should be ok to keep like this.
2368358
to
04c016d
Compare
Worker processes may have memory limitations imposed by Systemd. But /proc/meminfo shows the entire system memory regardless of these limits. This results in the scheduler believing the worker has the entire system memory avaliable and the worker being allocated too many tasks. This change attempts to read cgroup memory limits for the worker process. It supports cgroups v1 and v2, and compares cgroup limits against the system memory and returns the most conservative values to prevent the worker from being allocated too many tasks and potentially triggering an OOM event.
Attempting to report "memory used by other processes" in the MemReserved field fails to take into account the fact that the system's memory used includes memory used by ongoing tasks. To properly account for this, worker should report the memory and swap used, then the scheduler that is aware of the memory requirements for a task can determine if there is sufficient memory available for a task.
Before this change workers can only be allocated one GPU task, regardless of how much of the GPU resources that task uses, or how many GPUs are in the system. This makes GPUUtilization a float which can represent that a task needs a portion, or multiple GPUs. GPUs are accounted for like RAM and CPUs so that workers with more GPUs can be allocated more tasks. A known issue is that PC2 cannot use multiple GPUs. And even if the worker has multiple GPUs and is allocated multiple PC2 tasks, those tasks will only run on the first GPU. This could result in unexpected behavior when a worker with multiple GPUs is assigned multiple PC2 tasks. But this should not suprise any existing users who upgrade, as any existing users who run workers with multiple GPUs should already know this and be running a worker per GPU for PC2. But now those users have the freedom to customize the GPU utilization of PC2 to be less than one and effectively run multiple PC2 processes in a single worker. C2 is capable of utilizing multiple GPUs, and now workers can be customized for C2 accordingly.
In an environment with heterogenious worker nodes, a universal resource table for all workers does not allow effective scheduling of tasks. Some workers may have different proof cache settings, changing the required memory for different tasks. Some workers may have a different count of CPUs per core-complex, changing the max parallelism of PC1. This change allows workers to customize these parameters with environment variables. A worker could set the environment variable PC1_MIN_MEMORY for example to customize the minimum memory requirement for PC1 tasks. If no environment variables are specified, the resource table on the miner is used, except for PC1 parallelism. If PC1_MAX_PARALLELISM is not specified, and FIL_PROOFS_USE_MULTICORE_SDR is set, PC1_MAX_PARALLELSIM will automatically be set to FIL_PROOFS_MULTICORE_SDR_PRODUCERS + 1.
Co-authored-by: Aayush Rajasekaran <[email protected]>
04c016d
to
330cfc3
Compare
(rebased on latest master) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is all new to me and some of the scheduling logic (canHandleRequest for example) is still a little hard for me to follow. But overall I understand what you are doing and this looks good.
|
||
return storiface.WorkerInfo{ | ||
Hostname: "testworkerer", | ||
Resources: storiface.WorkerResources{ | ||
MemPhysical: res.MinMemory * 3, | ||
MemUsed: res.MinMemory, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should MemSwapUsed be added too?
) | ||
|
||
func cgroupV2MountPoint() (string, error) { | ||
f, err := os.Open("/proc/self/mountinfo") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we expect lotus will have permission to open this path? I guess users of cgroups will set up these directories to have the right permissions? People using this feature probably know what they're doing but maybe worth calling out in documentation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Normal users have access to this by default
} | ||
defer f.Close() //nolint | ||
|
||
scanner := bufio.NewScanner(f) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kinda overkill but consider parsing with something nice
return 0, 0, 0, 0, err | ||
} | ||
|
||
for path != "/" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know almost nothing about Cgroups but this loop structure confuses me. I am wondering why doesn't the output of cgroupv2.PidGroupPath return the correct path for getting memory limit information
defer cleanup() | ||
|
||
localTasks := []sealtasks.TaskType{ | ||
sealtasks.TTAddPiece, sealtasks.TTPreCommit1, sealtasks.TTCommit1, sealtasks.TTFinalize, sealtasks.TTFetch, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: for more clarify you could restrict operations to what the test uses, TTAddPiece
and TTFetch
iiuc
if w.MemUsedMax > 0 { | ||
break l | ||
} | ||
time.Sleep(time.Millisecond) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any chance this could hang? Maybe adding a break signal on the AddPiece goroutine completing would guard against weird hangs if for some reason memory usage is not propagating to worker correctly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It shouldn't (if it does for whatever reason, the test will just time out in 30min); (Normally this test takes 0.1s to run)
maxNeedMem := res.MemReserved + a.memUsedMax + needRes.MaxMemory + needRes.BaseMinMemory | ||
vmemNeeded := needRes.MaxMemory + needRes.BaseMinMemory | ||
vmemUsed := a.memUsedMax | ||
if vmemUsed < res.MemUsed+res.MemSwapUsed { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: giving res.MemUsed + res.MemSwapUsed
a name would make following along a bit easier
MinMemory uint64 // What Must be in RAM for decent perf | ||
MaxMemory uint64 // Memory required (swap + ram) | ||
MinMemory uint64 `envname:"MIN_MEMORY"` // What Must be in RAM for decent perf | ||
MaxMemory uint64 `envname:"MAX_MEMORY"` // Memory required (swap + ram) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"What Must be in RAM for decent perf" makes sense to me, but "Memory required (swap + ram)" is confusing me a bit given that the name is MAX. It sounds like this is MIN_MEMORY + MIN_SWAP? I think clarifying this comment would be helpful.
require.Equal(t, 1, ResourceTable[sealtasks.TTUnseal][stabi.RegisteredSealProof_StackedDrg2KiBV1_1].MaxParallelism) | ||
} | ||
|
||
func TestListResourceSDRMulticoreOverride(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice test
envval, found := lookup(taskType.Short()+"_"+shortSize+"_"+envname, fmt.Sprint(rr.Elem().Field(i).Interface())) | ||
if !found { | ||
// special multicore SDR handling | ||
if (taskType == sealtasks.TTPreCommit1 || taskType == sealtasks.TTUnseal) && envname == "MAX_PARALLELISM" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't that bad but it makes me wonder if we are hoping to deprecate FIL_PROOFS_USE_MULTICORE_SDR
. Other than supporting old workflows is there a reason to keep this old pattern around?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FIL_PROOFS_USE_MULTICORE_SDR is used internally inside proofs, and we don't have a better way to pass that into proofs right now.
Address Scheduler enhancements (#7703) review
This is a rebased version of #7269 with some cleanup
TLDR:
lotus-worker resources --default
)