Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve detection of CPU limits when running inside container #11933

Closed
wfurt opened this issue Jan 30, 2019 · 42 comments · Fixed by dotnet/coreclr#23413
Closed

improve detection of CPU limits when running inside container #11933

wfurt opened this issue Jan 30, 2019 · 42 comments · Fixed by dotnet/coreclr#23413
Assignees
Milestone

Comments

@wfurt
Copy link
Member

wfurt commented Jan 30, 2019

With dotnet/corefx#25193 in place Environment.ProcessorCount can now reflect limits imposed via docker. When running within docker run --cpus=2 -ti microsoft/dotnet-buildtools-prereqs:rhel7_prereqs_2 /bin/bash , this call returns 2 as expected on 8 core host.

However when limits are enforced in different way, it fails to detect it. Let say I limit container to only first two cores:

docker run --cpuset-cpus=0,1 -ti microsoft/dotnet-buildtools-prereqs:rhel7_prereqs_2 /bin/bash

[root@c39dcf61cf3a tests]# nproc
2

This shows that container is limited to to cores.
But Environment.ProcessorCount returns 8.

related to https://github.com/dotnet/corefx/issues/34920

Value obtained via sched_getaffinity() is also 2.

#define _GNU_SOURCE 1
#include <stdio.h>
#include <sched.h>

int main(int argc, char ** argv) {
    cpu_set_t set;
    if (sched_getaffinity (getpid(), sizeof (set), &set) == 0) {
        printf("count=%d\n", CPU_COUNT (&set));
     } else printf("FAILED!\n");
}
[root@c39dcf61cf3a tmp]# ./count
count=2

note that sched_getaffinity() does not work for first case when limits are enforced via limiting cycles.

Perhaps we need to do both and return lower value.

cc: @janvorli

@wfurt
Copy link
Member Author

wfurt commented Jan 30, 2019

One more note, that cpus can be passed in as fragments of cores:

docker run --cpus=1.5 microsoft/dotnet-buildtools-prereqs:rhel7_prereqs_2  "/bin/bash"

In this case Environment.ProcessorCount returns 1.

@luhenry
Copy link
Contributor

luhenry commented Mar 20, 2019

The 2 docker CLI options that affect this issue are:

  1. --cpus: this limits the amount of CPU time available to the container (ex: 1.8 means 180% CPU time, ie on 2 cores 90% for each core, on 4 cores 45% on each core, etc.)
  2. --cpuset-cpus: this limits the number of CPU we have access to; it also specifies which specific processor we have access to, but that’s irrelevant in this project

All the runtime components depending on the number of processors available are:

  • ThreadPool
  • GC
  • Environment.ProcessorCount via SystemNative::GetProcessorCount
  • SimpleRWLock::m_spinCount
  • BaseDomain::m_iNumberOfProcessors (it's used to determine the GC heap to affinitize to)

All components but Environment.ProcessorCount above are aware and take advantage of the values passed to --cpus and --cpuset-cpus.

--cpus

dotnet/coreclr#12797 has already been done. It impacts all above runtime components (allowing to optimize performance in a container/machine with limited resources). This makes sure the runtime components makes the best use of available resources.

In the case of Environment.ProcessorCount, the behavior is such that passing --cpus=1.5 on a machine with 8 processors will return 1 as shown in https://github.com/dotnet/coreclr/issues/22302#issuecomment-459092299. This behavior is not consistent with Windows Job Objects which still returns the number of processors for the container/machine even if it only gets parts of the total number of cycles.

I would argue that in the case of Environment.ProcessorCount, we would want to return the actual number of processors and not the rounded-down value of --cpus. It is right now an overloaded variable which can mean more than one thing, and which is inconsistent between platforms.

--cpuset-cpus

The work has been done here for all runtime components except Environment.ProcessorCount. The work would consist in fixing any of SystemNative::GetProcessorCount, CPUGroupInfo::InitCPUGroupInfoArray or GetLogicalProcessorInformationEx to use sched_getaffinity.

@luhenry
Copy link
Contributor

luhenry commented Mar 20, 2019

To complement the behavior of Environment.ProcessorCount returning the actual value of processors (ie reverting to behavior before dotnet/coreclr#12797), we should add a Environment.ProcessorQuota returning a float value between 0 and 1 giving information to the user how much of the per-processor time can the process use. This would allow for Environment.ProcessorCount to stay consistent across Windows Job Objects and containers, and would not change the behavior of Environment.ProcessorCount with and without --cpus, while still giving users of Environment.ProcessorCount for performance optimizations purposes the ability to minimize context switches cost.

@VSadov
Copy link
Member

VSadov commented Mar 20, 2019

Just to clarify: does --cpus have parallelism limiting effect at all?

From reading the docs it seems it is just a CPU quota weighted in a total number of corers.
I.E. --cpus=1 on an 4-core machine means that your time slices are throttled to 25% of total. It does not seem to imply that concurrency is limited in any way.

Basically you may still utilize 4 threads, but will observe that each runs roughly 1/4 of the normal speed.

Is that the right understanding?

@luhenry
Copy link
Contributor

luhenry commented Mar 20, 2019

Exactly, --cpus does not reduce the number of processors available, it limits the total amount of time available to the container.

@janvorli
Copy link
Member

Ah, ok, I didn't know that. I was assuming that it affinitizes the process to the minimum number of processors needed plus adds some throttling. Since it is not the case, it seems we should ignore the --cpus for the purpose of getting CPU count.

@VSadov
Copy link
Member

VSadov commented Mar 20, 2019

Then --cpus should not have effect on Environment.ProcessorCount.
It makes a container slower not less parallel.

Even with reduced quota you may want to use threads, if that makes you more efficient or more responsive.

On the other hand --cpuset-cpus specifically reduces the parallelism level.
For example you may not want to exceed the total number when spawning worker threads.
I think --cpuset-cpus should be reflected in Environment.ProcessorCount.

@Maoni0
Copy link
Member

Maoni0 commented Mar 20, 2019

from GC's POV, if --cpus is specified to only use M cores out of N, we would want to create only M heaps though (not affinitized to any specific CPUs).

@wfurt
Copy link
Member Author

wfurt commented Mar 20, 2019

--cpus does not need to be int and therefore has no direct implication for cores @Maoni0. I think the explanation with slowing to N core equivalent is good one.

@luhenry
Copy link
Contributor

luhenry commented Mar 20, 2019

Running locally, I get the following:

$> docker run -it --cpus=1.5 microsoft/dotnet-buildtools-prereqs:rhel7_prereqs_2  "/bin/bash"
[root@71affec82676 /]# nproc
8
$ docker run -it --cpuset-cpus=0-2,5 microsoft/dotnet-buildtools-prereqs:rhel7_prereqs_2  "/bin/bash"
[root@84082c18cab1 /]# nproc
4

To summarize, I'll do the following:

  • for Environment.ProcessorCount, I'll revert back to return the actual number of processors even when passing --cpus and I'll fix the behavior when passing --cpuset-cpus to return the number of processors alloted to the container (this should match the values returned by nproc).
  • for all other use cases (ThreadPool, GC, etc.), we'll keep the current behavior of slowing to the rounded-down value of --cpus, so we make sure to keep optimal performance by minimizing cost of context switches (with less time per-processor for each thread, the relative cost of context-switches goes up).

@luhenry
Copy link
Contributor

luhenry commented Mar 20, 2019

And all uses of Environment.ProcessorCount are as follows: https://source.dot.net/#System.Private.CoreLib/src/System/Environment.CoreCLR.cs,e5c0f3a0c450c2f3,references

@stephentoub what are your thoughts on adding a Environment.ProcessorQuota to allow for performance-minded users to optimize for the same cases as the runtime?

@VSadov
Copy link
Member

VSadov commented Mar 20, 2019

Having more information is generally better, but it is not easy to see how Environment.ProcessorQuota could be used.

Perhaps SpinWait and similar could use that to reduce spinning vs. sleeping ... - since we get effectively a slower CPU ?

@wfurt
Copy link
Member Author

wfurt commented Mar 20, 2019

@luhenry
Copy link
Contributor

luhenry commented Mar 20, 2019

@VSadov it could be used mostly to more accurately estimate the optimal number of threads, the same way it's used by the GC and the ThreadPool to better use available resources.

@stephentoub
Copy link
Member

it could be used mostly to more accurately estimate the optimal number of threads,

Could you pick a few of the current uses of ProcessorCount and show how the quota would be used and what the benefit would be?

@janvorli
Copy link
Member

--cpuset-cpus

The work has been done here for all runtime components except Environment.ProcessorCount. The work would consist in fixing any of SystemNative::GetProcessorCount, CPUGroupInfo::InitCPUGroupInfoArray or GetLogicalProcessorInformationEx to use sched_getaffinity.

There are actually two code paths in the SystemNative::GetProcessorCount. One is for the case when NUMA support is enabled in runtime (CPUGroupInfo::CanEnableThreadUseAllCpuGroups() returns TRUE) and one for the other case. In the case when NUMA is not enabled, we use the value returned from GetSystemInfo in the dwNumberOfProcessors. That value will also need to be changed to make it correctly influenced by the --cpuset-cpus. Right now, the value is what we get from sysconf(_SC_NPROCESSORS_ONLN) and that value is not influenced by the --cpuset-cpus.

@luhenry
Copy link
Contributor

luhenry commented Mar 20, 2019

@janvorli I am updating PAL_GetLogicalCpuCountFromOS which is used both by GetSystemInfo (for the non-NUMA case) and by NUMASupportInitialize (for the case where NUMA is not available).

This raise the question as to what should be returned in the case of NUMA being enabled and available, and for that I do not know, and I would love to understand better how NUMA fits in this project.

@janvorli
Copy link
Member

As for NUMA, it seems we could prune the CPUs reported in the bitmasks by the libnuma by the thread affinity mask before we parse those bitmasks. I believe it would work fine. I say "I believe" since we cannot match it to Windows behavior as Windows don't seem to have a way to limit a process to run on a subset of CPUs only when the process uses the SetThreadIdealProcessorEx.

@tmds
Copy link
Member

tmds commented Mar 21, 2019

I am not a container platform expert, but occasionally I do some things with OpenShift/Kubernetes, I have only seen settings that control the CPU share (like docker cpus). I haven't seen settings that restrict the number of CPU (cfr docker cpuset-cpus).

Since there is no Environment.ProcessorQuota, any existing software uses Environment.ProcessorCount to scale depending on CPU.

So for a container platform user, it may be preferable to stick to the current implementation.

luhenry referenced this issue in luhenry/coreclr Mar 21, 2019
There are 2 Docker CLI command line options available that we are interested in here:
 - `--cpus`: this limits the amount of CPU time available to the container (ex: 1.8 means 180% CPU time, ie on 2 cores 90% for each core, on 4 cores 45% on each core, etc.)
 - `--cpuset-cpus`: this limits the number of processors we have access to on the CPU; it also specifies which specific processor we have access to, but that’s irrelevant here

All the runtime components depending on the number of processors available are:
 - ThreadPool
 - GC
 - `Environment.ProcessorCount` via `SystemNative::GetProcessorCount`
 - `SimpleRWLock::m_spinCount`
 - `BaseDomain::m_iNumberOfProcessors` (it's used to determine the GC heap to affinitize to)

All components but `Environment.ProcessorCount` above are aware and take advantage of the values passed to `--cpus` and `--cpuset-cpus`.

**`--cpus`**

dotnet#12797 has already been done. It impacts all above runtime components (allowing to optimize performance in a container/machine with limited resources). This makes sure the runtime components makes the best use of available resources.

In the case of `Environment.ProcessorCount`, the behavior is such that passing `--cpus=1.5` on a machine with 8 processors will return `1`  as shown in https://github.com/dotnet/coreclr/issues/22302#issuecomment-459092299. This behavior is not consistent with [Windows Job Objects](https://docs.microsoft.com/en-us/windows/desktop/api/winnt/ns-winnt-jobobject_cpu_rate_control_information) which still returns the number of processors for the container/machine even if it only gets parts of the total number of cycles.

This behavior is erroneous because the container still has access to the full range of processors on the machine, and only its _processor time_ is limited. For example, in the case of a 4 processors machine, with a value of `--cpus=1.8`, there can be 4 threads running in parallel even though each thread will only get `1.8 / 8 = .45` or 45% of all cycles of each processor.

The work consist in reverting the behavior of `SystemNative::GetProcessorCount` to pre dotnet#12797.

**`--cpuset-cpus`**

The work has been done here for all runtime components except `Environment.ProcessorCount`. The work consist in fixing `PAL_GetLogicalCpuCountFromOS` to use `sched_getaffinity`.

Fixes https://github.com/dotnet/coreclr/issues/22302
@luhenry
Copy link
Contributor

luhenry commented Mar 22, 2019

luhenry referenced this issue in luhenry/coreclr Mar 22, 2019
This focuse on better supporting `--cpuset-cpus` which limits the number of processors we have access to on the CPU; it also specifies which specific processor we have access to, but that’s irrelevant here

The work has been done here for all runtime components except `Environment.ProcessorCount`. The work consist in fixing `PAL_GetLogicalCpuCountFromOS` to use `sched_getaffinity`.

Fixes https://github.com/dotnet/coreclr/issues/22302
luhenry referenced this issue in luhenry/coreclr Mar 22, 2019
This focuses on better supporting Docker CLI's parameter `--cpus`, which limits the amount of CPU time available to the container (ex: 1.8 means 180% CPU time, ie on 2 cores 90% for each core, on 4 cores 45% on each core, etc.)

All the runtime components depending on the number of processors available are:
 - ThreadPool
 - GC
 - `Environment.ProcessorCount` via `SystemNative::GetProcessorCount`
 - `SimpleRWLock::m_spinCount`
 - `BaseDomain::m_iNumberOfProcessors` (it's used to determine the GC heap to affinitize to)

All the above components take advantage of `--cpus` via `CGroup::GetCpuLimit` with dotnet#12797, allowing to optimize performance in a container/machine with limited resources. This makes sure the runtime components makes the best use of available resources.

In the case of `Environment.ProcessorCount`, the behavior is such that passing `--cpus=1.5` on a machine with 8 processors will return `1`  as shown in https://github.com/dotnet/coreclr/issues/22302#issuecomment-459092299. This behavior is not consistent with [Windows Job Objects](https://docs.microsoft.com/en-us/windows/desktop/api/winnt/ns-winnt-jobobject_cpu_rate_control_information) which still returns the number of processors for the container/machine even if it only gets parts of the total number of cycles.

This behavior is erroneous because the container still has access to the full range of processors on the machine, and only its _processor time_ is limited. For example, in the case of a 4 processors machine, with a value of `--cpus=1.8`, there can be 4 threads running in parallel even though each thread will only get `1.8 / 8 = .45` or 45% of all cycles of each processor.

The work consist in reverting the behavior of `SystemNative::GetProcessorCount` to pre dotnet#12797.
@jkotas
Copy link
Member

jkotas commented Mar 22, 2019

dotnet/[email protected]:env-processorquota

Does this change actually make things always better?

We have said in the earlier discussion that the process quota really gives you slower processor, but it does not reduce the level of parallelism available...

I think the underlying problem that all this is trying to solve is: What is the right size of pool or cache for a given component. The processor count works well for this on balanced machines that have reasonable amount of memory per processor. Docker makes it easy to create environments that are imbalanced: a lot of (potential) processor power and not enough memory; or vice versa. The simplistic algorithms to compute size of pool or cache work poorly in these conditions. Maybe what we need to do here is a API specifically designed for building pools and caches that auto-adjust their sizes bases on many factors.

@luhenry
Copy link
Contributor

luhenry commented Mar 22, 2019

@jkotas it doesn't necessarily give you slower processors, but it reduces the budget you have on all the processors on the machine. I posted an example at dotnet/coreclr#23398 (comment), that illustrate some other cases like spiky workloads.

A good example of a workload that benefits from this quota is a Parallel.ForEach that does not access any shared state and operate purely on its input (ex: a map step of a map-reduce algorithm). In this case, do not want to overcommit on a number of threads as it will only bring degradation in term of performance. This goes hand-in-hand with dotnet/coreclr#23398, so if we don't go with this PR then the Environment.ProcessorQuota is not required.

If we do not go with dotnet/coreclr#23398, we then have to figure out what is the best value to return in case that could be rounded up. For example, should --cpus=1.5 round up to 2 processors (with 75% if 2 threads are very CPU intensive) , or stay at 1 processor and exposing ourselves to underutilization. I'll run some benchmarks with ASP.NET to figure out some high level answers to that.

@VSadov
Copy link
Member

VSadov commented Mar 22, 2019

Parallel.ForEach could be a good example.

Note that ThreadPool is not. One of the TP metrics is latency - how quickly it may dispatch bursts of tasks. Going to #procs will still be advantageous for that even if under a CPU quota.

Parallel.ForEach on the other hand may be assumed to care only about throughput, so knowing quota could be helpful for more optimal partitioning.

@janvorli
Copy link
Member

For this discussion, I think it is also good to get full understanding on how the scheduling in the container works. There are couple of docs I've found:

The kubernetes doc that @tmds has provided (https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/) says

spec.containers[].resources.limits.cpu is converted to its millicore value and multiplied by 100. The resulting value is the total amount of CPU time that a container can use every 100ms. A container cannot use more than its share of CPU time during this interval.

The cgroups doc at https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/6/html/resource_management_guide/sec-cpu#sect-cfs seems to be more specific on the scheduling behavior:

cpu.cfs_quota_us
specifies the total amount of time in microseconds (µs, represented here as "us") for which all tasks in a cgroup can run during one period (as defined by cpu.cfs_period_us). As soon as tasks in a cgroup use up all the time specified by the quota, they are throttled for the remainder of the time specified by the period and not allowed to run until the next period.

Even more detailed description of the CFS scheduling can be found here:
https://github.com/torvalds/linux/blob/master/Documentation/scheduler/sched-design-CFS.txt

@luhenry luhenry reopened this Mar 25, 2019
@luhenry
Copy link
Contributor

luhenry commented Mar 25, 2019

I ran some ASP.NET Core benchmarks with --cpuset-cpus=1,2 and --cpus=1.999999999 or --cpus=2. I got the following results with the plaintext benchmark on a VM on Azure:

  1. with --cpuset-cpus=1,2 and with --cpus=2
RPS Max CPU (%) [1] Avg. Latency Startup
600,472 100 4.11 211
601,953 100 4.11 279
607,330 100 4.17 308
603,674 100 4.27 216
603,054 99 4.05 207
  1. with --cpuset-cpus=1,2 and with --cpus=1.999999999
RPS Max CPU (%) [1] Avg. Latency Startup
399,743 169 6.17 292
405,412 160 6.2 236
408,685 173 6.24 260
492,366 196 5.07 233
414,865 177 6.07 323
  1. with --cpuset-cpus=1,2 and with --cpus=1.7
RPS Max CPU (%) [1] Avg. Latency Startup
431,411 170 6.2 280
418,328 156 6.1 322
407,473 153 6.25 276
433,062 156 5.89 210
428,323 163 6.01 306

We can see a clear drop in performance between passing --cpus=1.999999999 or --cpus=2, and the performance between --cpus=1.999999999 and --cpus=1.7 is similar even though we reduced by 15% the total budget of the CPU.

My hypothesis is that the difference between 1.999999999 and 2 are negligible from the OS / scheduler perspective and that the difference is explained by the different choices taken by the runtime in term of ThreadPool, GC and others. To verify that hypothesis I will run with 2 locally compiled versions of coreclr, with and without dotnet/coreclr#23398, and with --cpus=1.999999999. I'll post the results here.

[1] the Max CPU (%) should be between 0-100%, but in 2. and 3. it's over 100%. That's because the total time spent on CPU is divided by Environment.ProcessorCount which in the case of .NET Core 3.0 returns the rounded-down value passed to --cpus

@tmds
Copy link
Member

tmds commented Mar 25, 2019

@luhenry rounding up makes sense. If your container has 1.7 cpu, it will run on more than 1 core. Your benchmark results confirm this.

luhenry referenced this issue in luhenry/coreclr Mar 27, 2019
This focuses on better supporting Docker CLI's parameter `--cpus`, which limits the amount of CPU time available to the container (ex: 1.8 means 180% CPU time, ie on 2 cores 90% for each core, on 4 cores 45% on each core, etc.)

All the runtime components depending on the number of processors available are:
 - ThreadPool
 - GC
 - `Environment.ProcessorCount` via `SystemNative::GetProcessorCount`
 - `SimpleRWLock::m_spinCount`
 - `BaseDomain::m_iNumberOfProcessors` (it's used to determine the GC heap to affinitize to)

All the above components take advantage of `--cpus` via `CGroup::GetCpuLimit` with dotnet#12797, allowing to optimize performance in a container/machine with limited resources. This makes sure the runtime components makes the best use of available resources.

In the case of `Environment.ProcessorCount`, the behavior is such that passing `--cpus=1.5` on a machine with 8 processors will return `1`  as shown in https://github.com/dotnet/coreclr/issues/22302#issuecomment-459092299. This behavior is not consistent with [Windows Job Objects](https://docs.microsoft.com/en-us/windows/desktop/api/winnt/ns-winnt-jobobject_cpu_rate_control_information) which still returns the number of processors for the container/machine even if it only gets parts of the total number of cycles.

This behavior is erroneous because the container still has access to the full range of processors on the machine, and only its _processor time_ is limited. For example, in the case of a 4 processors machine, with a value of `--cpus=1.8`, there can be 4 threads running in parallel even though each thread will only get `1.8 / 8 = .45` or 45% of all cycles of each processor.

The work consist in reverting the behavior of `SystemNative::GetProcessorCount` to pre dotnet#12797.
luhenry referenced this issue in luhenry/coreclr Mar 27, 2019
This focuses on better supporting Docker CLI's parameter `--cpus`, which limits the amount of CPU time available to the container (ex: 1.8 means 180% CPU time, ie on 2 cores 90% for each core, on 4 cores 45% on each core, etc.)

All the runtime components depending on the number of processors available are:
 - ThreadPool
 - GC
 - `Environment.ProcessorCount` via `SystemNative::GetProcessorCount`
 - `SimpleRWLock::m_spinCount`
 - `BaseDomain::m_iNumberOfProcessors` (it's used to determine the GC heap to affinitize to)

All the above components take advantage of `--cpus` via `CGroup::GetCpuLimit` with dotnet#12797, allowing to optimize performance in a container/machine with limited resources. This makes sure the runtime components makes the best use of available resources.

In the case of `Environment.ProcessorCount`, the behavior is such that passing `--cpus=1.5` on a machine with 8 processors will return `1`  as shown in https://github.com/dotnet/coreclr/issues/22302#issuecomment-459092299. This behavior is not consistent with [Windows Job Objects](https://docs.microsoft.com/en-us/windows/desktop/api/winnt/ns-winnt-jobobject_cpu_rate_control_information) which still returns the number of processors for the container/machine even if it only gets parts of the total number of cycles.

This behavior is erroneous because the container still has access to the full range of processors on the machine, and only its _processor time_ is limited. For example, in the case of a 4 processors machine, with a value of `--cpus=1.8`, there can be 4 threads running in parallel even though each thread will only get `1.8 / 8 = .45` or 45% of all cycles of each processor.

The work consist in reverting the behavior of `SystemNative::GetProcessorCount` to pre dotnet#12797.
@luhenry
Copy link
Contributor

luhenry commented Mar 27, 2019

dotnet/coreclr#23398 fixes the situation for the ThreadPool and the GC when passing --cpus=1.999999999. It now simply rounds to the nearest integer.

I am now verifying the case of --cpus=1.4999999999 and --cpus=1.5 to verify there is no performance cliff between both cases. In the case where we are observing a less than ideal drop, we should explore rounding up instead of rounding to nearest integer.

@luhenry
Copy link
Contributor

luhenry commented Mar 27, 2019

  1. with --cpuset-cpus=1,2 --cpus=1.5 + rounding to nearest integer
RPS Max CPU (%) [1] Avg. Latency Startup
469,780 151 7.69 288
479,875 151 7.47 284
468,337 151 7.63 270
479,624 151 7.58 203
482,347 151 7.52 259
  1. with --cpuset-cpus=1,2 --cpus=1.499999999 + rounding to nearest integer
RPS Max CPU (%) [1] Avg. Latency Startup
431,592 148 6.53 265
419,993 145 6.29 237
405,770 143 6.62 248
414,939 140 6.26 257
414,848 148 6.61 285
  1. with --cpuset-cpus=1,2 --cpus=1.499999999 + rounding up
RPS Max CPU (%) [1] Avg. Latency Startup
417,095 150 6.62 191
428,464 141 6.48 246
428,544 147 6.35 196
423,698 148 6.41 272
436,548 145 6.35 220
  1. with --cpuset-cpus=1,2 --cpus=1.000000001 + rounding up
RPS Max CPU (%) [1] Avg. Latency Startup
320,702 101 11.69 288
310,606 100 12.91 262
317,907 101 13.56 309
316,390 102 12.88 195
319,566 104 12.09 272
  1. with --cpuset-cpus=1,2 --cpus=1 + rounding up
RPS Max CPU (%) [1] Avg. Latency Startup
316,496 102 12.63 282
316,616 103 13.19 411
317,123 102 12.57 208
315,176 101 13.5 203
317,981 102 12.7 261

We can observe in the --cpus=1.499999999 case that rounded up gives slightly better results, and that the --cpus=1.000000001 and --cpus=1 are equivalent in terms of performance.

Based on that, I am proposing to go with rounding up in order to maximize the use of available CPU.

@janvorli @jkotas @sergiy-k

luhenry referenced this issue in luhenry/coreclr Mar 27, 2019
This focuses on better supporting Docker CLI's parameter `--cpus`, which limits the amount of CPU time available to the container (ex: 1.8 means 180% CPU time, ie on 2 cores 90% for each core, on 4 cores 45% on each core, etc.)

All the runtime components depending on the number of processors available are:
 - ThreadPool
 - GC
 - `Environment.ProcessorCount` via `SystemNative::GetProcessorCount`
 - `SimpleRWLock::m_spinCount`
 - `BaseDomain::m_iNumberOfProcessors` (it's used to determine the GC heap to affinitize to)

All the above components take advantage of `--cpus` via `CGroup::GetCpuLimit` with dotnet#12797, allowing to optimize performance in a container/machine with limited resources. This makes sure the runtime components makes the best use of available resources.

In the case of `Environment.ProcessorCount`, the behavior is such that passing `--cpus=1.5` on a machine with 8 processors will return `1`  as shown in https://github.com/dotnet/coreclr/issues/22302#issuecomment-459092299. This behavior is not consistent with [Windows Job Objects](https://docs.microsoft.com/en-us/windows/desktop/api/winnt/ns-winnt-jobobject_cpu_rate_control_information) which still returns the number of processors for the container/machine even if it only gets parts of the total number of cycles.

This behavior is erroneous because the container still has access to the full range of processors on the machine, and only its _processor time_ is limited. For example, in the case of a 4 processors machine, with a value of `--cpus=1.8`, there can be 4 threads running in parallel even though each thread will only get `1.8 / 8 = .45` or 45% of all cycles of each processor.

The work consist in reverting the behavior of `SystemNative::GetProcessorCount` to pre dotnet#12797.
@VSadov
Copy link
Member

VSadov commented Mar 27, 2019

In the past when I needed to measure changes in ThreadPool latency, I was pointed to the following snippet:
dotnet/coreclr#13670 (comment)

Not sure if the benchmark exists in some other form.
That could be used to measure TP performance under various conditions, just run with:

	TaskBurstWorkThroughput 1PcT 000.5PcWi
	TaskBurstWorkThroughput 1PcT 001.0PcWi
	TaskBurstWorkThroughput 1PcT 004.0PcWi
	TaskBurstWorkThroughput 1PcT 016.0PcWi
	TaskBurstWorkThroughput 1PcT 064.0PcWi
	TaskBurstWorkThroughput 1PcT 256.0PcWi
	TaskSustainedWorkThroughput 1PcT
	ThreadPoolBurstWorkThroughput 1PcT 000.5PcWi
	ThreadPoolBurstWorkThroughput 1PcT 001.0PcWi
	ThreadPoolBurstWorkThroughput 1PcT 004.0PcWi
	ThreadPoolBurstWorkThroughput 1PcT 016.0PcWi
	ThreadPoolBurstWorkThroughput 1PcT 064.0PcWi
        ThreadPoolBurstWorkThroughput 1PcT 256.0PcWi
	ThreadPoolSustainedWorkThroughput 1PcT

I think that the initial worker limit should not change and be just the physical number of cores (i.e. the level of parallelizm available).
However the benchmark may show something different. . .

Where CPU quota could be more interesting is in GetCPUBusyTime_NT.

That is used to detect situations where CPU is lightly loaded and we are not making progress with tasks (which together indicates that workers are blocked), then we allow more workers as a progress guarantee/deadlock prevention measure.

The part where we reason about "CPU is lightly loaded" could give biased reading when quotas are involved since it currently expects that all cores in affinity mask can do 100%.
Under quota we may misdetect "lightly loaded" situation and insert workers when no more CPU power is available.

CC: @kouvel

@VSadov
Copy link
Member

VSadov commented Mar 27, 2019

Note - there is also PAL_GetCPUBusyTime - not NT version, perhaps more interesting in a container

luhenry referenced this issue in luhenry/coreclr Mar 28, 2019
This focuses on better supporting Docker CLI's parameter `--cpus`, which limits the amount of CPU time available to the container (ex: 1.8 means 180% CPU time, ie on 2 cores 90% for each core, on 4 cores 45% on each core, etc.)

All the runtime components depending on the number of processors available are:
 - ThreadPool
 - GC
 - `Environment.ProcessorCount` via `SystemNative::GetProcessorCount`
 - `SimpleRWLock::m_spinCount`
 - `BaseDomain::m_iNumberOfProcessors` (it's used to determine the GC heap to affinitize to)

All the above components take advantage of `--cpus` via `CGroup::GetCpuLimit` with dotnet#12797, allowing to optimize performance in a container/machine with limited resources. This makes sure the runtime components makes the best use of available resources.

In the case of `Environment.ProcessorCount`, the behavior is such that passing `--cpus=1.5` on a machine with 8 processors will return `1`  as shown in https://github.com/dotnet/coreclr/issues/22302#issuecomment-459092299. This behavior is not consistent with [Windows Job Objects](https://docs.microsoft.com/en-us/windows/desktop/api/winnt/ns-winnt-jobobject_cpu_rate_control_information) which still returns the number of processors for the container/machine even if it only gets parts of the total number of cycles.

This behavior is erroneous because the container still has access to the full range of processors on the machine, and only its _processor time_ is limited. For example, in the case of a 4 processors machine, with a value of `--cpus=1.8`, there can be 4 threads running in parallel even though each thread will only get `1.8 / 8 = .45` or 45% of all cycles of each processor.

The work consist in reverting the behavior of `SystemNative::GetProcessorCount` to pre dotnet#12797.
luhenry referenced this issue in luhenry/coreclr Mar 29, 2019
This focuses on better supporting Docker CLI's parameter `--cpus`, which limits the amount of CPU time available to the container (ex: 1.8 means 180% CPU time, ie on 2 cores 90% for each core, on 4 cores 45% on each core, etc.)

All the runtime components depending on the number of processors available are:
 - ThreadPool
 - GC
 - `Environment.ProcessorCount` via `SystemNative::GetProcessorCount`
 - `SimpleRWLock::m_spinCount`
 - `BaseDomain::m_iNumberOfProcessors` (it's used to determine the GC heap to affinitize to)

All the above components take advantage of `--cpus` via `CGroup::GetCpuLimit` with dotnet#12797, allowing to optimize performance in a container/machine with limited resources. This makes sure the runtime components makes the best use of available resources.

In the case of `Environment.ProcessorCount`, the behavior is such that passing `--cpus=1.5` on a machine with 8 processors will return `1`  as shown in https://github.com/dotnet/coreclr/issues/22302#issuecomment-459092299. This behavior is not consistent with [Windows Job Objects](https://docs.microsoft.com/en-us/windows/desktop/api/winnt/ns-winnt-jobobject_cpu_rate_control_information) which still returns the number of processors for the container/machine even if it only gets parts of the total number of cycles.

This behavior is erroneous because the container still has access to the full range of processors on the machine, and only its _processor time_ is limited. For example, in the case of a 4 processors machine, with a value of `--cpus=1.8`, there can be 4 threads running in parallel even though each thread will only get `1.8 / 8 = .45` or 45% of all cycles of each processor.

The work consist in reverting the behavior of `SystemNative::GetProcessorCount` to pre dotnet#12797.
luhenry referenced this issue in luhenry/coreclr Apr 2, 2019
This focuses on better supporting Docker CLI's parameter `--cpus`, which limits the amount of CPU time available to the container (ex: 1.8 means 180% CPU time, ie on 2 cores 90% for each core, on 4 cores 45% on each core, etc.)

All the runtime components depending on the number of processors available are:
 - ThreadPool
 - GC
 - `Environment.ProcessorCount` via `SystemNative::GetProcessorCount`
 - `SimpleRWLock::m_spinCount`
 - `BaseDomain::m_iNumberOfProcessors` (it's used to determine the GC heap to affinitize to)

All the above components take advantage of `--cpus` via `CGroup::GetCpuLimit` with dotnet#12797, allowing to optimize performance in a container/machine with limited resources. This makes sure the runtime components makes the best use of available resources.

In the case of `Environment.ProcessorCount`, the behavior is such that passing `--cpus=1.5` on a machine with 8 processors will return `1`  as shown in https://github.com/dotnet/coreclr/issues/22302#issuecomment-459092299. This behavior is not consistent with [Windows Job Objects](https://docs.microsoft.com/en-us/windows/desktop/api/winnt/ns-winnt-jobobject_cpu_rate_control_information) which still returns the number of processors for the container/machine even if it only gets parts of the total number of cycles.

This behavior is erroneous because the container still has access to the full range of processors on the machine, and only its _processor time_ is limited. For example, in the case of a 4 processors machine, with a value of `--cpus=1.8`, there can be 4 threads running in parallel even though each thread will only get `1.8 / 8 = .45` or 45% of all cycles of each processor.

The work consist in reverting the behavior of `SystemNative::GetProcessorCount` to pre dotnet#12797.
luhenry referenced this issue in luhenry/coreclr Apr 5, 2019
This focuses on better supporting Docker CLI's parameter `--cpus`, which limits the amount of CPU time available to the container (ex: 1.8 means 180% CPU time, ie on 2 cores 90% for each core, on 4 cores 45% on each core, etc.)

All the runtime components depending on the number of processors available are:
 - ThreadPool
 - GC
 - `Environment.ProcessorCount` via `SystemNative::GetProcessorCount`
 - `SimpleRWLock::m_spinCount`
 - `BaseDomain::m_iNumberOfProcessors` (it's used to determine the GC heap to affinitize to)

All the above components take advantage of `--cpus` via `CGroup::GetCpuLimit` with dotnet#12797, allowing to optimize performance in a container/machine with limited resources. This makes sure the runtime components makes the best use of available resources.

In the case of `Environment.ProcessorCount`, the behavior is such that passing `--cpus=1.5` on a machine with 8 processors will return `1`  as shown in https://github.com/dotnet/coreclr/issues/22302#issuecomment-459092299. This behavior is not consistent with [Windows Job Objects](https://docs.microsoft.com/en-us/windows/desktop/api/winnt/ns-winnt-jobobject_cpu_rate_control_information) which still returns the number of processors for the container/machine even if it only gets parts of the total number of cycles.

This behavior is erroneous because the container still has access to the full range of processors on the machine, and only its _processor time_ is limited. For example, in the case of a 4 processors machine, with a value of `--cpus=1.8`, there can be 4 threads running in parallel even though each thread will only get `1.8 / 8 = .45` or 45% of all cycles of each processor.

The work consist in reverting the behavior of `SystemNative::GetProcessorCount` to pre dotnet#12797.
@luhenry luhenry closed this as completed Apr 12, 2019
@mrmartan
Copy link

mrmartan commented Jul 25, 2019

As @janvorli said:

For this discussion, I think it is also good to get full understanding on how the scheduling in the container works.

Whatever you did is IMHO not sufficient.
Your discussion here turns around two docker arguments. It in fact runs deeper, to the way those settings are propagated to OS kernel. And there are more docker options than those two (and I suppose docker can change the way they map those onto OS). I know Kubernetes uses different docker options than those discussed here as well. See: https://medium.com/@betz.mark/understanding-resource-limits-in-kubernetes-cpu-time-9eff74d3161b

I could give you more detailed info if you'd like. First a description of our use case of ASP.NET Core APIs running in Kubernetes.

tl;dr The way .NET currently works we are forced to run our docker containers without specifying any limits.

We are running one of APIs in 14 instances (docker containers spread on our Kubernetes cluster - workers are CentOS based). Base image used is mcr.microsoft.com/dotnet/core/aspnet:2.2.6. When no CPU limits are specified it servers around 1k HTTP request per second consuming cca 350m of CPU per instance (kube pod). The total number of processors reported by .NET is 224 (16 per instance since each kubernetes worker is a VM with 16 logical CPUs). Setting kube CPU limit anywhere near our real usage causes .NET to report 14 CPUs (one per instance, ie. each dotnet process bahaves as if running on a single-core machine) and makes our system unable to handle its load.

To cope we have to increase scaling about three times (to 42 instances) which obviously consumes way more resources than required (memory especially). Setting the limit higher than necessary does not protect us from overconsuming resources while limiting concurrency (whatever having e.g. Environment.ProcessorCount == 2 means for concurrency since I assume from OS perspective dotnet process threads are not limited to 2 cores <- as far as I can tell Kubernetes does not use the --cpuset-cpus docker option neither --cpus).

I have also tried .NET Core 3 Preview 7 and it behaves the same on Kubernetes as 2.2.6

@janvorli
Copy link
Member

While the discussion was about two docker arguments, coreclr runtime reads the stuff from cgroups (cpu.cfs_quota_us and cpu.cfs_period_us)and from the process affinity and not from the docker. So it is independent on how the limits were specified.
There was a discussion above on what to report in Environment.ProcessorCount on a system that has a limit specified by these cgroups settings. Whether to report the number of processor cores the .NET process can run on or the cpu.cfs_quota_us / cpu.cfs_period_us. The final consensus, also based on performance results of some asp.net benchmarks and other measurements was to make it report the cpu.cfs_quota_us / cpu.cfs_period_us.

I would like to understand why setting the CPU limit near your real CPU usage causes the performance issues. I have a couple of questions:

  • Does your app explicitly make decisions based on Environment.ProcessorCount or it is just due to what ASP.NET does?
  • What are the CPU limits you were setting?
  • How large is the perf degradation you are seeing?
  • What happens perf-wise if you double the CPU limits?

@mrmartan
Copy link

mrmartan commented Jul 29, 2019

Does your app explicitly make decisions based on Environment.ProcessorCount or it is just due to what ASP.NET does?

No, it does not.

What are the CPU limits you were setting?

600m

How large is the perf degradation you are seeing?

We have to increase scaling about 3 times to handle the same load. I can't experiment with our Production environment as much as I would like :)

I should probably add that the app is heavy on I/O, executing dozens of async network I/O operations per incoming request.

What happens perf-wise if you double the CPU limits?

That's what I intend to test. To set the limits so that Environment.ProcessorCount > 1. Mainly because I read that when Environment.ProcessorCount == 1 CoreCLR forces GC to run in workstation mode.

@mrmartan
Copy link

@janvorli I did not mean to suggest that this is implemented through some docker integration. I can see that it is done by means of kernel cgroups/cfs. I can also see from the discussion that it is fairly complex issue.

Still, setting CPU quota is not intended to limit concurrency (at least not in kubernetes) and the way CoreCLR tunes itself is not appropriate in all use cases.
I have found similar issue in Java and Java 10 uses the same approach as CoreCLR here. They did expose a knob to let the user override the available CPU count. So unless Environment.ProcessorCount changes to take its value from cgroups cpuset (and I don't suppose it will) it would be useful to be able to override Environment.ProcessorCount (or whatever is appropriate for .NET as not everything CPU related is driven by Environment.ProcessorCount) by the developer.

@mrmartan
Copy link

mrmartan commented Aug 7, 2019

Just to follow up on this if anyone would feel like addressing it. I feel like the current state of .NET is not suitable for Kubernetes deployments.

Following is the result of changing kube pod CPU limit from unlimited (on 16 CPU kube workers) to 2000m (and only that change):

Average CPU consumption is 280m which is down from previous 350m.
Memory consumption dropped by two thirds.
Surprisingly even our response time 90th percentil dropped by about 5%

image

The way .NET works forces us to leave Kubernetes CPU limit at 2000m even though we do not require that much for this given use case.

@jkotas
Copy link
Member

jkotas commented Aug 7, 2019

@mrmartan Could you please create a new github issue describing the problem? Comments on closed PRs are unlikely to get the response from the right people.

@VSadov
Copy link
Member

VSadov commented Aug 7, 2019

@mrmartan - I have created an issue https://github.com/dotnet/coreclr/issues/26053 to my best understanding of your problem.
Please take a look and comment if that does/doesnot cover your case.

@tmds
Copy link
Member

tmds commented Aug 8, 2019

It would be good to do some perf tracing to understand why containers with low cpu allocation are not performing as good as expected.

Such low allocations are common on Kubernetes, so it may be worth doing continuous perf benchmarks.

cc @davidfowl @sebastienros @adamsitnik

@mrmartan
Copy link

mrmartan commented Aug 8, 2019

I now see here two issues. dotnet/coreclr#26053 @VSadov opened (I will try and add some details to it) and possibly the one @tmds is hinting at, i.e. is low performance expected with low CPU (single core; EDIT: or rather perceived by .NET as single core) allocation.

I have a screenshot form a test I performed on our test Kubernetes cluster (mostly same configuration as discused above). In the test there was only one replica of the app in question. With kube CPU limit set to 700m the actual CPU consumption was about 700m and the result is on the left side of the image. With kube CPU limit at 2000m the actual CPU consumption was 750m and the result is on the right side of the image.

The image is from SuperBenchmarker (unfortunately I cropped it without the scale ☹️ )
image
Pink is latency, green is RPS (requests per second; maximum was 650 RPS)

@tmds
Copy link
Member

tmds commented Dec 10, 2019

In the test there was only one replica of the app in question. With kube CPU limit set to 700m the actual CPU consumption was about 700m and the result is on the left side of the image. With kube CPU limit at 2000m the actual CPU consumption was 750m and the result is on the right side of the image.

These numbers are hard to compare.
With kube CPU limit at 700m, the app is restricted by CPU.
With kube CPU limit at 2000m, the app is no longer restricted by CPU.

@mrmartan
Copy link

I see your point. I no longer have the setup to test it again and post result here but at the time I could have given it 1000m and the result would be the same (while actual CPU consumption would not rise).

The case is the application was not CPU limited. CoreCLR was crippled since it thought it was running on a single core machine. On K8s/Linux 700m CPU limit does not imply 1 CPU/thread but CoreCLR behaved as though.

@msftgits msftgits transferred this issue from dotnet/coreclr Jan 31, 2020
@msftgits msftgits added this to the 3.0 milestone Jan 31, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Dec 14, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants