-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Improve support for --cpus
to Docker CLI
#23398
Conversation
/cc @sergiy-k |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM modulo the nit.
Actually, there is one more place where we use the PAL_GetCpuLimit - the GetCurrentProcessCpuCount in util.cpp. Can you please remove the call to it from there as well and then remove the function itself (its definition, its declaration in pal.h and also from mscordac_unixexports.src)? |
Not relevant to this PR, but should we be surfacing that as part of System.Diagnostics.Process.ProcessorAffinity? Right now we just use sched_getaffinity: |
Hmm, I didn't know about the System.Diagnostics.Process.ProcessorAffinity before. It is unfortunate that it represents the Windows way of expressing affinity where a thread can be affinitized to max 64 processors from the same processor group. On Unix, there is no limit like this, a thread can be affinitized to all processors on the system. So the Unix affinity cannot be represented correctly by 64 bit mask for systems with more than 64 CPUs. |
Yup. C'est la vie. |
@stephentoub we use |
@janvorli I wouldn't want to remove the use of Let's set up a few examples where
Using Per https://github.com/dotnet/coreclr/issues/22302#issuecomment-475115658, I can understand why we would want to keep the existing behavior at https://github.com/dotnet/coreclr/pull/23398/files#diff-0d2e4fca23d725bbd5e637d1fe57ed0fL346 to allow for the same kind of optimizations without relying on a new |
@luhenry it seems confusing to expose different processor counts for the managed runtime / user code and different for the GC / threadpool. The managed user / runtime code can rely on the number of processors reported by the runtime for similar or the same purposes as the GC or threadpool. To I would tend to think that we should expose the same processor count to both. Based on the recent discussion, I have switched my view of the limit applied by |
imo, we should look at this from the Kubernetes perspective, and not from the Docker cli perspective. This is Kubernetes documentation for managing compute resources: https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/ |
If I assign a container 0.1 cpu on a 16 core machine, I rather have it run for 10% on 1 core, than for 0.625% on 16. The kernel scheduler should 'smart' about this. |
I'll split off the non-controversial part of this PR [1] to another PR, just to get it in while we figure this discussion here. [1] https://github.com/dotnet/coreclr/pull/23398/files#diff-4be22576babe49f6d25015c61d3f3ad2R105 |
Environment.ProcessorCount
when passing --cpus
to Docker CLI
58443dc
to
9d70701
Compare
@jkotas no particularely good reason for 2 different names. It seems to me Budget is a bit more explicit, but I'll change it to Limit for consistency sake. |
There is a lot of change in this PR. Instead, couldn't we change these lines to round up? coreclr/src/pal/src/misc/cgroup.cpp Lines 103 to 110 in 9f285b7
|
Environment.ProcessorCount
when passing --cpus
to Docker CLI--cpus
to Docker CLI
To verify the beneficial effect of this change I ran the ASP.NET Core benchmarks with the following configurations and results:
We can observe a clear gain in RPS and reduction in Avg Latency. |
src/pal/src/misc/cgroup.cpp
Outdated
cpu_count = quota / period; | ||
|
||
cpu_count = (double) quota / period; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Cannot have less than 1 CPU
if (quota <= period)
{
*val = 1;
return true;
}
cpu_count = (double) quota / period;
if (cpu_count < UINT32_MAX)
{
// round up
*val = (uint32_t)(cpu_count + 0.999999999);
}
else
{
*val = UINT32_MAX;
}
maybe change this to:
*val = (UINT)(min((quota + period - 1)/period, UINT32_MAX))
if (PAL_GetCpuLimit(&cpuLimit) && cpuLimit < (uint32_t)processorCount) | ||
processorCount = cpuLimit; | ||
#endif | ||
|
||
END_QCALL; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to remove this? Doesn't that change Environment.ProcessorCount
to no longer take into account --cpus?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's the goal of the change.
I haven't followed the full discussion, but one thing to confirm: |
@stephentoub yes, and that's after discussion with @VSadov at https://github.com/dotnet/coreclr/issues/22302#issuecomment-477386311 and performance measurements that confirms it is the better approach. @tmds we verified that the resuts are actually better with these changes (including the change to |
Ok, good, thanks. It ends up being impactful in situations where a system queues Environment.ProcessorCount work items and expects that they will all be able to run concurrently (PLINQ does this for some queries, for example, where the work items it queues all need to join with each other at a barrier). While other work being executed can impact this, at least on an unloaded system if the min thread pool count >= ProcessorCount, the work items should all be able to start and run concurrently quickly. If the min thread pool count < ProcessorCount, then we immediately hit a starvation situation and it could take quite some time for the pool to ramp up to the required number of threads. |
The |
That's fine. |
It is not uncommon to run a lot of containers with cpu quota less than 1 on beefy servers with many cores. The benchmarks results posted here don't include that scenario. I think we may be reducing performance in that case. |
@tmds following are results running the ASP.NET Core Benchmarks
We can see there is no decrease in performance with the changes, and even see increase in perfornmance with https://github.com/dotnet/coreclr/issues/22302#issuecomment-477330281 and #23398 (comment) |
@luhenry thanks for doing that benchmark. Unfortunately, I don't feel confident myself about this change yet. It's not clear for me what is now taking into account |
@@ -148,7 +148,8 @@ class SimpleRWLock | |||
} CONTRACTL_END; | |||
|
|||
m_RWLock = 0; | |||
m_spinCount = (GetCurrentProcessCpuCount() == 1) ? 0 : 4000; | |||
// Passing false here reduces ASP.NET Core Plaintext benchmark results from 1.2M to 0.8M RPS. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious - where (stacktrace) does the ASP.NET Core Plaintext take this lock?
This focuses on better supporting Docker CLI's parameter `--cpus`, which limits the amount of CPU time available to the container (ex: 1.8 means 180% CPU time, ie on 2 cores 90% for each core, on 4 cores 45% on each core, etc.) All the runtime components depending on the number of processors available are: - ThreadPool - GC - `Environment.ProcessorCount` via `SystemNative::GetProcessorCount` - `SimpleRWLock::m_spinCount` - `BaseDomain::m_iNumberOfProcessors` (it's used to determine the GC heap to affinitize to) All the above components take advantage of `--cpus` via `CGroup::GetCpuLimit` with dotnet#12797, allowing to optimize performance in a container/machine with limited resources. This makes sure the runtime components makes the best use of available resources. In the case of `Environment.ProcessorCount`, the behavior is such that passing `--cpus=1.5` on a machine with 8 processors will return `1` as shown in https://github.com/dotnet/coreclr/issues/22302#issuecomment-459092299. This behavior is not consistent with [Windows Job Objects](https://docs.microsoft.com/en-us/windows/desktop/api/winnt/ns-winnt-jobobject_cpu_rate_control_information) which still returns the number of processors for the container/machine even if it only gets parts of the total number of cycles. This behavior is erroneous because the container still has access to the full range of processors on the machine, and only its _processor time_ is limited. For example, in the case of a 4 processors machine, with a value of `--cpus=1.8`, there can be 4 threads running in parallel even though each thread will only get `1.8 / 8 = .45` or 45% of all cycles of each processor. The work consist in reverting the behavior of `SystemNative::GetProcessorCount` to pre dotnet#12797.
This focuses on better supporting Docker CLI's parameter
--cpus
, which limits the amount of CPU time available to the container (ex: 1.8 means 180% CPU time, ie on 2 cores 90% for each core, on 4 cores 45% on each core, etc.)All the runtime components depending on the number of processors available are:
Environment.ProcessorCount
viaSystemNative::GetProcessorCount
SimpleRWLock::m_spinCount
BaseDomain::m_iNumberOfProcessors
(it's used to determine the GC heap to affinitize to)All the above components take advantage of
--cpus
viaCGroup::GetCpuLimit
with #12797, allowing to optimize performance in a container/machine with limited resources. This makes sure the runtime components makes the best use of available resources.In the case of
Environment.ProcessorCount
, the behavior is such that passing--cpus=1.5
on a machine with 8 processors will return1
as shown in #22302 (comment). This behavior is not consistent with Windows Job Objects which still returns the number of processors for the container/machine even if it only gets parts of the total number of cycles.This behavior is erroneous because the container still has access to the full range of processors on the machine, and only its processor time is limited. For example, in the case of a 4 processors machine, with a value of
--cpus=1.8
, there can be 4 threads running in parallel even though each thread will only get1.8 / 8 = .45
or 45% of all cycles of each processor.The work consist in reverting the behavior of
SystemNative::GetProcessorCount
to pre #12797.