-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
proposal: automatic resource controls #2108
Comments
I've noticed that it's more likely for io/network to be causing congestion more so than CPU or memory - especially with all the file-syncing that goes on typically. I feel like that's the more important bottleneck to address. In any case, this feels like the right direction and will be especially useful for managing multitenancy. |
@vladaionescu Do you mean more about the pulls/pushes being too much in parallel rather than the io bottlenecks in exec? I had #1989 separately for it that really improved some giant builds, but unfortunately introduced a deadlock that needs to be figured out for implementation and was reverted. |
As you mentioned above this, the OS-level scheduler generally does a good job managing CPU contention. I can see why we might want to delay starting ops based on load, since trying to run too many ops at once could lead to suboptimal performance. But I'm curious about the rationale for pausing ops when CPU load gets high - do you think BuildKit would be able to make good decisions about what to pause and when to pause it, which would be better overall than the scheduler's timeslicing? |
I do think that 100% cpu usage is not something we should avoid. Contrary, builder should maximize the time cpu is fully utilized. But I think there are some extreme cases where it might be better for us to pause. Eg. conventional wisdom tells that make should be started with Startup delays would be enough, except processes (groups) do not have constant CPU usage. Therefore there might be no load initially that picks up later. I agree that we do need to be careful with these pauses. I don't want a flickering experience where a process flips between running and paused all the time. I guess some practical cases would tell for sure how effective pausing would be. I think it at least makes sense to test it. |
Yes, exactly - that kind of limiting is really helpful for really large builds. I've been noticing random timeouts due to simple |
I've been catching up on cgroupv2/psi and looks quite related and |
Hey guys any plans to continue development of this ? What @tonistiigi is describing here is really nice and complex, but a simple bookkeeping of RAM using something like the MemoryHint to a RUN command would make a world of difference. This is especially painful because any external constraint does not know if the RUN is cached or not ... Thus maybe it make to make small steps here and implement something relatively simple ? |
After giving this a bit more though and looking more into cgroupv2/psi I might suggest this kind of simple implementation for a resource (say memory):
|
#2049 is adding a simple parallelization limit. This proposal describes a more complex follow-up to this problem. The goal is that when a build is scaled up(eg. take a build that compiles app now and change it to a build that compiles the same app from 100 different commits), it does not cause the builder to crash and the machine to become unresponsive (or catch fire). This should happen without requiring manual configuration. If you move from high powered machine to a low-powered one (eg. rpi) build should slow down linearly without additional bottlenecks from inefficient execution.
In the simplest form, the scheduler should be combined with system state monitoring. When the monitor detects that machine resources have been reached, it starts blocking new jobs. In some cases, existing jobs may need to be paused.
This is different from cgroup controls that may be applied independently.
I'm only concentrating on cpu and memory. In the future, this could be extended to io/network. It's also probably too early to discuss making initial predictions for possible resource usage based on command arguments.
New daemon config values:
MaxMemory - defaults/max to current available memory
MemoryBuffer - a percentage of MaxMemory
NumCPUs ? maybe but not very related
New values for ExecOp:
MemoryHint - How much memory process expects to need
MemoryLimit - Limit memory under these bounds (with cgroup)
Every
Op
is initialized with a sharedResourceManager
instance. Solver/scheduler has no knowledge of the limits and only calls theAcquire()
method of the op that blocks until the op can start(or is canceled). This is important for the case where we have multiple workers. 2 vertexes on different workers have different resource managers and don't block each other. Also, solver/vertex definition is very generic and doesn't fit with very specific linux resources. Hopefully, this doesn't limit us from making smarter scheduling decisions.Acquire()
method initializes current ID with theResourceManager
. If the system is exhausted, then the function will block. DuringExec
Op will monitor its own resources (cgroup of the containers it created) and callrm.Update()
. Update call may choose to return a channel. If that happens, op should pause its execution(eg with freezer cgroup) and wait for that channel to return.ResourceManager compares system stats with the values sent by the ops and makes decisions when to block certain ops when they call
Init/Update
and when to restart them. ResourceManager should be unit testable with custom system stats provider implementation.CPU
The main parameter to monitor here is if the system CPU is exhausted. This can be determined based on
vmstat
(/proc equivalent of it) by checking the length of the run queue compared to the number of cpus, and cpu idle time. If CPU is exhausted, then additional ops can't run. Values should be determined as a weighted average over a time period to minimize wrong decisions from quick changes. At least historical values should be taken into account when determining if CPU is free again. When CPU is exhausted, starting new ops can be blocked without historical data. When one op has finished, algorithm should be smart enough to understand that this CPU time is not used anymore.The second problem that should be avoided is starting too many ops in parallel when the stats are low, and then as they start, they exhaust the CPU. This should be done by introducing delays if too many processes start at the same time. To predict the delays, I think we need to look at system load as well as count/speed of the CPUs. Eg. we should be able to detect that RPi needs longer delays. One way to think about this problem is that we set a prediction of CPU usage on every op we start. Initially, this prediction has a big std deviation. Over time, when we get actual values via
Update()
, that stddev gets smaller, and we can trust the data if it says that there is more CPU power left.Generally, over-using CPU is not as big of a problem as doing the same with memory. Kernel's scheduler can balance quite well, and more parallel ops usually give faster build times. So we should not try to be very precise but avoid extreme cases.
Although rare, I do think we need some logic to also pause ops when needed. For example, let's say there is a lot of processes running
./configure && make
. Configure is usually quite sequential, so CPU load will not be detected. It also takes a long time, so the startup delays do not have an effect. Butmake
can likely take advantage of the whole CPU and create a big run queue. So if the run queue remains very long for a long time,Update
should send a signal to op to pause it.We can also look into scheduling priorities to give some ops more CPU than others.
Memory
Memory monitoring is somewhat similar, but we need to be more precise. The manager is configured with
MaxMemory
parameter, capped with maximum free system memory, and the goal is for the ops total memory usage to never go past that value.Unlike CPU, once we pause op, it does not release the memory it already uses(at least without checkpoint/restore that is out of scope atm). This means that we need a buffer to allow memory to grow and that we need to predict how much memory an op will take in the future.
For a better prediction op can give
MemoryHint
andMemoryLimit
value with definition. Hint assumes how much memory will be needed to avoid overflow when it is known that process uses lots of resources. Limit sets a cgroup limit and can be used as an upper cap for the prediction.If no hint was set, prediction starts with a value based on a constant and some average of memory usage for previous builds (later args could be used for better historic prediction). Initially, the prediction has a high std deviation.
Once the process has started, it sends updates about its memory usage. This data can be used for a future prediction based on the changes in previous data. As we get more data we can trust it more and stddev gets smaller.
Examples:
Initial prediction: 200MB
Process starts, takes 30MB instantly, 31MB in 10s, 32MB in 20s
Prediction: 200MB, 5sec 100MB, 10s 40MB
Initial prediction: 200MB
Process starts, takes 100MB instantly, 200MB in 10s, 300MB in 20s, 320MB in 30s, 325MB in 40s
Prediction: 200MB, 10sec 300MB, 20s 1GB, 30s 500MB , 40s 400MB
It's unclear how much in the future the prediction should be. We probably need a lot of tunable parameters to determine the best values.
All the ops memory predictions are added together and compared with max available memory. A buffer is also applied to allow the processes that are left running to grow their memory.
If the prediction shows that the memory limit is about to be reached, one of the ops is paused. It probably makes sense to pause the op that was most aggressively acquiring new memory.
When Op pauses, we need a way for this to show up in the progress bar. This should be solvable with a new state in
Vertex
status structure. This part could be possible as a separate step and should be possibly done first as there may be a problem with backward compatibility with old clients.As another follow-up resource manager should be able to return debug info. With analysis of that info, it should be possible to predict if the build needs more CPU, iops, memory etc and how much faster it would have been on a machine with different capabilities.
@vladaionescu @AkihiroSuda @hinshun @aaronlehmann @crazy-max
The text was updated successfully, but these errors were encountered: