improve detection of CPU limits when running inside container #11933

wfurt · 2019-01-30T19:42:46Z

With dotnet/corefx#25193 in place Environment.ProcessorCount can now reflect limits imposed via docker. When running within docker run --cpus=2 -ti microsoft/dotnet-buildtools-prereqs:rhel7_prereqs_2 /bin/bash , this call returns 2 as expected on 8 core host.

However when limits are enforced in different way, it fails to detect it. Let say I limit container to only first two cores:

docker run --cpuset-cpus=0,1 -ti microsoft/dotnet-buildtools-prereqs:rhel7_prereqs_2 /bin/bash

[root@c39dcf61cf3a tests]# nproc
2

This shows that container is limited to to cores.
But Environment.ProcessorCount returns 8.

related to https://github.com/dotnet/corefx/issues/34920

Value obtained via sched_getaffinity() is also 2.

#define _GNU_SOURCE 1
#include <stdio.h>
#include <sched.h>

int main(int argc, char ** argv) {
    cpu_set_t set;
    if (sched_getaffinity (getpid(), sizeof (set), &set) == 0) {
        printf("count=%d\n", CPU_COUNT (&set));
     } else printf("FAILED!\n");
}
[root@c39dcf61cf3a tmp]# ./count
count=2

note that sched_getaffinity() does not work for first case when limits are enforced via limiting cycles.

Perhaps we need to do both and return lower value.

cc: @janvorli

The text was updated successfully, but these errors were encountered:

wfurt · 2019-01-30T20:15:33Z

One more note, that cpus can be passed in as fragments of cores:

docker run --cpus=1.5 microsoft/dotnet-buildtools-prereqs:rhel7_prereqs_2  "/bin/bash"

In this case Environment.ProcessorCount returns 1.

luhenry · 2019-03-20T21:10:13Z

The 2 docker CLI options that affect this issue are:

--cpus: this limits the amount of CPU time available to the container (ex: 1.8 means 180% CPU time, ie on 2 cores 90% for each core, on 4 cores 45% on each core, etc.)
--cpuset-cpus: this limits the number of CPU we have access to; it also specifies which specific processor we have access to, but that’s irrelevant in this project

All the runtime components depending on the number of processors available are:

ThreadPool
GC
Environment.ProcessorCount via SystemNative::GetProcessorCount
SimpleRWLock::m_spinCount
BaseDomain::m_iNumberOfProcessors (it's used to determine the GC heap to affinitize to)

All components but Environment.ProcessorCount above are aware and take advantage of the values passed to --cpus and --cpuset-cpus.

--cpus

dotnet/coreclr#12797 has already been done. It impacts all above runtime components (allowing to optimize performance in a container/machine with limited resources). This makes sure the runtime components makes the best use of available resources.

In the case of Environment.ProcessorCount, the behavior is such that passing --cpus=1.5 on a machine with 8 processors will return 1 as shown in https://github.com/dotnet/coreclr/issues/22302#issuecomment-459092299. This behavior is not consistent with Windows Job Objects which still returns the number of processors for the container/machine even if it only gets parts of the total number of cycles.

I would argue that in the case of Environment.ProcessorCount, we would want to return the actual number of processors and not the rounded-down value of --cpus. It is right now an overloaded variable which can mean more than one thing, and which is inconsistent between platforms.

--cpuset-cpus

The work has been done here for all runtime components except Environment.ProcessorCount. The work would consist in fixing any of SystemNative::GetProcessorCount, CPUGroupInfo::InitCPUGroupInfoArray or GetLogicalProcessorInformationEx to use sched_getaffinity.

luhenry · 2019-03-20T21:16:54Z

To complement the behavior of Environment.ProcessorCount returning the actual value of processors (ie reverting to behavior before dotnet/coreclr#12797), we should add a Environment.ProcessorQuota returning a float value between 0 and 1 giving information to the user how much of the per-processor time can the process use. This would allow for Environment.ProcessorCount to stay consistent across Windows Job Objects and containers, and would not change the behavior of Environment.ProcessorCount with and without --cpus, while still giving users of Environment.ProcessorCount for performance optimizations purposes the ability to minimize context switches cost.

VSadov · 2019-03-20T21:17:47Z

Just to clarify: does --cpus have parallelism limiting effect at all?

From reading the docs it seems it is just a CPU quota weighted in a total number of corers.
I.E. --cpus=1 on an 4-core machine means that your time slices are throttled to 25% of total. It does not seem to imply that concurrency is limited in any way.

Basically you may still utilize 4 threads, but will observe that each runs roughly 1/4 of the normal speed.

Is that the right understanding?

luhenry · 2019-03-20T21:20:03Z

Exactly, --cpus does not reduce the number of processors available, it limits the total amount of time available to the container.

janvorli · 2019-03-20T21:29:52Z

Ah, ok, I didn't know that. I was assuming that it affinitizes the process to the minimum number of processors needed plus adds some throttling. Since it is not the case, it seems we should ignore the --cpus for the purpose of getting CPU count.

VSadov · 2019-03-20T21:30:05Z

Then --cpus should not have effect on Environment.ProcessorCount.
It makes a container slower not less parallel.

Even with reduced quota you may want to use threads, if that makes you more efficient or more responsive.

On the other hand --cpuset-cpus specifically reduces the parallelism level.
For example you may not want to exceed the total number when spawning worker threads.
I think --cpuset-cpus should be reflected in Environment.ProcessorCount.

Maoni0 · 2019-03-20T21:39:53Z

from GC's POV, if --cpus is specified to only use M cores out of N, we would want to create only M heaps though (not affinitized to any specific CPUs).

wfurt · 2019-03-20T21:43:22Z

--cpus does not need to be int and therefore has no direct implication for cores @Maoni0. I think the explanation with slowing to N core equivalent is good one.

luhenry · 2019-03-20T21:46:54Z

Running locally, I get the following:

$> docker run -it --cpus=1.5 microsoft/dotnet-buildtools-prereqs:rhel7_prereqs_2  "/bin/bash"
[root@71affec82676 /]# nproc
8

$ docker run -it --cpuset-cpus=0-2,5 microsoft/dotnet-buildtools-prereqs:rhel7_prereqs_2  "/bin/bash"
[root@84082c18cab1 /]# nproc
4

To summarize, I'll do the following:

for Environment.ProcessorCount, I'll revert back to return the actual number of processors even when passing --cpus and I'll fix the behavior when passing --cpuset-cpus to return the number of processors alloted to the container (this should match the values returned by nproc).
for all other use cases (ThreadPool, GC, etc.), we'll keep the current behavior of slowing to the rounded-down value of --cpus, so we make sure to keep optimal performance by minimizing cost of context switches (with less time per-processor for each thread, the relative cost of context-switches goes up).

luhenry · 2019-03-20T21:50:36Z

And all uses of Environment.ProcessorCount are as follows: https://source.dot.net/#System.Private.CoreLib/src/System/Environment.CoreCLR.cs,e5c0f3a0c450c2f3,references

@stephentoub what are your thoughts on adding a Environment.ProcessorQuota to allow for performance-minded users to optimize for the same cases as the runtime?

VSadov · 2019-03-20T22:04:13Z

Having more information is generally better, but it is not easy to see how Environment.ProcessorQuota could be used.

Perhaps SpinWait and similar could use that to reduce spinning vs. sleeping ... - since we get effectively a slower CPU ?

wfurt · 2019-03-20T22:08:52Z

cc: @jkotas & @tmds as participants of https://github.com/dotnet/corefx/issues/25193

luhenry · 2019-03-20T22:11:36Z

@VSadov it could be used mostly to more accurately estimate the optimal number of threads, the same way it's used by the GC and the ThreadPool to better use available resources.

stephentoub · 2019-03-20T22:23:03Z

it could be used mostly to more accurately estimate the optimal number of threads,

Could you pick a few of the current uses of ProcessorCount and show how the quota would be used and what the benefit would be?

janvorli · 2019-03-20T22:33:55Z

--cpuset-cpus

The work has been done here for all runtime components except Environment.ProcessorCount. The work would consist in fixing any of SystemNative::GetProcessorCount, CPUGroupInfo::InitCPUGroupInfoArray or GetLogicalProcessorInformationEx to use sched_getaffinity.

There are actually two code paths in the SystemNative::GetProcessorCount. One is for the case when NUMA support is enabled in runtime (CPUGroupInfo::CanEnableThreadUseAllCpuGroups() returns TRUE) and one for the other case. In the case when NUMA is not enabled, we use the value returned from GetSystemInfo in the dwNumberOfProcessors. That value will also need to be changed to make it correctly influenced by the --cpuset-cpus. Right now, the value is what we get from sysconf(_SC_NPROCESSORS_ONLN) and that value is not influenced by the --cpuset-cpus.

luhenry · 2019-03-20T22:43:23Z

@janvorli I am updating PAL_GetLogicalCpuCountFromOS which is used both by GetSystemInfo (for the non-NUMA case) and by NUMASupportInitialize (for the case where NUMA is not available).

This raise the question as to what should be returned in the case of NUMA being enabled and available, and for that I do not know, and I would love to understand better how NUMA fits in this project.

janvorli · 2019-03-21T01:23:27Z

As for NUMA, it seems we could prune the CPUs reported in the bitmasks by the libnuma by the thread affinity mask before we parse those bitmasks. I believe it would work fine. I say "I believe" since we cannot match it to Windows behavior as Windows don't seem to have a way to limit a process to run on a subset of CPUs only when the process uses the SetThreadIdealProcessorEx.

tmds · 2019-03-21T05:16:52Z

I am not a container platform expert, but occasionally I do some things with OpenShift/Kubernetes, I have only seen settings that control the CPU share (like docker cpus). I haven't seen settings that restrict the number of CPU (cfr docker cpuset-cpus).

Since there is no Environment.ProcessorQuota, any existing software uses Environment.ProcessorCount to scale depending on CPU.

So for a container platform user, it may be preferable to stick to the current implementation.

There are 2 Docker CLI command line options available that we are interested in here: - `--cpus`: this limits the amount of CPU time available to the container (ex: 1.8 means 180% CPU time, ie on 2 cores 90% for each core, on 4 cores 45% on each core, etc.) - `--cpuset-cpus`: this limits the number of processors we have access to on the CPU; it also specifies which specific processor we have access to, but that’s irrelevant here All the runtime components depending on the number of processors available are: - ThreadPool - GC - `Environment.ProcessorCount` via `SystemNative::GetProcessorCount` - `SimpleRWLock::m_spinCount` - `BaseDomain::m_iNumberOfProcessors` (it's used to determine the GC heap to affinitize to) All components but `Environment.ProcessorCount` above are aware and take advantage of the values passed to `--cpus` and `--cpuset-cpus`. **`--cpus`** dotnet#12797 has already been done. It impacts all above runtime components (allowing to optimize performance in a container/machine with limited resources). This makes sure the runtime components makes the best use of available resources. In the case of `Environment.ProcessorCount`, the behavior is such that passing `--cpus=1.5` on a machine with 8 processors will return `1` as shown in https://github.com/dotnet/coreclr/issues/22302#issuecomment-459092299. This behavior is not consistent with [Windows Job Objects](https://docs.microsoft.com/en-us/windows/desktop/api/winnt/ns-winnt-jobobject_cpu_rate_control_information) which still returns the number of processors for the container/machine even if it only gets parts of the total number of cycles. This behavior is erroneous because the container still has access to the full range of processors on the machine, and only its _processor time_ is limited. For example, in the case of a 4 processors machine, with a value of `--cpus=1.8`, there can be 4 threads running in parallel even though each thread will only get `1.8 / 8 = .45` or 45% of all cycles of each processor. The work consist in reverting the behavior of `SystemNative::GetProcessorCount` to pre dotnet#12797. **`--cpuset-cpus`** The work has been done here for all runtime components except `Environment.ProcessorCount`. The work consist in fixing `PAL_GetLogicalCpuCountFromOS` to use `sched_getaffinity`. Fixes https://github.com/dotnet/coreclr/issues/22302

luhenry · 2019-03-22T17:24:16Z

@stephentoub @jkotas dotnet/corefx@master...luhenry:env-processorquota

This focuse on better supporting `--cpuset-cpus` which limits the number of processors we have access to on the CPU; it also specifies which specific processor we have access to, but that’s irrelevant here The work has been done here for all runtime components except `Environment.ProcessorCount`. The work consist in fixing `PAL_GetLogicalCpuCountFromOS` to use `sched_getaffinity`. Fixes https://github.com/dotnet/coreclr/issues/22302

This focuses on better supporting Docker CLI's parameter `--cpus`, which limits the amount of CPU time available to the container (ex: 1.8 means 180% CPU time, ie on 2 cores 90% for each core, on 4 cores 45% on each core, etc.) All the runtime components depending on the number of processors available are: - ThreadPool - GC - `Environment.ProcessorCount` via `SystemNative::GetProcessorCount` - `SimpleRWLock::m_spinCount` - `BaseDomain::m_iNumberOfProcessors` (it's used to determine the GC heap to affinitize to) All the above components take advantage of `--cpus` via `CGroup::GetCpuLimit` with dotnet#12797, allowing to optimize performance in a container/machine with limited resources. This makes sure the runtime components makes the best use of available resources. In the case of `Environment.ProcessorCount`, the behavior is such that passing `--cpus=1.5` on a machine with 8 processors will return `1` as shown in https://github.com/dotnet/coreclr/issues/22302#issuecomment-459092299. This behavior is not consistent with [Windows Job Objects](https://docs.microsoft.com/en-us/windows/desktop/api/winnt/ns-winnt-jobobject_cpu_rate_control_information) which still returns the number of processors for the container/machine even if it only gets parts of the total number of cycles. This behavior is erroneous because the container still has access to the full range of processors on the machine, and only its _processor time_ is limited. For example, in the case of a 4 processors machine, with a value of `--cpus=1.8`, there can be 4 threads running in parallel even though each thread will only get `1.8 / 8 = .45` or 45% of all cycles of each processor. The work consist in reverting the behavior of `SystemNative::GetProcessorCount` to pre dotnet#12797.

jkotas · 2019-03-22T20:26:48Z

dotnet/[email protected]:env-processorquota

Does this change actually make things always better?

We have said in the earlier discussion that the process quota really gives you slower processor, but it does not reduce the level of parallelism available...

I think the underlying problem that all this is trying to solve is: What is the right size of pool or cache for a given component. The processor count works well for this on balanced machines that have reasonable amount of memory per processor. Docker makes it easy to create environments that are imbalanced: a lot of (potential) processor power and not enough memory; or vice versa. The simplistic algorithms to compute size of pool or cache work poorly in these conditions. Maybe what we need to do here is a API specifically designed for building pools and caches that auto-adjust their sizes bases on many factors.

luhenry · 2019-03-22T21:01:19Z

@jkotas it doesn't necessarily give you slower processors, but it reduces the budget you have on all the processors on the machine. I posted an example at dotnet/coreclr#23398 (comment), that illustrate some other cases like spiky workloads.

A good example of a workload that benefits from this quota is a Parallel.ForEach that does not access any shared state and operate purely on its input (ex: a map step of a map-reduce algorithm). In this case, do not want to overcommit on a number of threads as it will only bring degradation in term of performance. This goes hand-in-hand with dotnet/coreclr#23398, so if we don't go with this PR then the Environment.ProcessorQuota is not required.

If we do not go with dotnet/coreclr#23398, we then have to figure out what is the best value to return in case that could be rounded up. For example, should --cpus=1.5 round up to 2 processors (with 75% if 2 threads are very CPU intensive) , or stay at 1 processor and exposing ourselves to underutilization. I'll run some benchmarks with ASP.NET to figure out some high level answers to that.

VSadov · 2019-03-22T23:46:04Z

Parallel.ForEach could be a good example.

Note that ThreadPool is not. One of the TP metrics is latency - how quickly it may dispatch bursts of tasks. Going to #procs will still be advantageous for that even if under a CPU quota.

Parallel.ForEach on the other hand may be assumed to care only about throughput, so knowing quota could be helpful for more optimal partitioning.

janvorli · 2019-03-25T11:37:38Z

For this discussion, I think it is also good to get full understanding on how the scheduling in the container works. There are couple of docs I've found:

The kubernetes doc that @tmds has provided (https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/) says

spec.containers[].resources.limits.cpu is converted to its millicore value and multiplied by 100. The resulting value is the total amount of CPU time that a container can use every 100ms. A container cannot use more than its share of CPU time during this interval.

The cgroups doc at https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/6/html/resource_management_guide/sec-cpu#sect-cfs seems to be more specific on the scheduling behavior:

cpu.cfs_quota_us
specifies the total amount of time in microseconds (µs, represented here as "us") for which all tasks in a cgroup can run during one period (as defined by cpu.cfs_period_us). As soon as tasks in a cgroup use up all the time specified by the quota, they are throttled for the remainder of the time specified by the period and not allowed to run until the next period.

Even more detailed description of the CFS scheduling can be found here:
https://github.com/torvalds/linux/blob/master/Documentation/scheduler/sched-design-CFS.txt

luhenry · 2019-03-25T17:41:12Z

I ran some ASP.NET Core benchmarks with --cpuset-cpus=1,2 and --cpus=1.999999999 or --cpus=2. I got the following results with the plaintext benchmark on a VM on Azure:

with --cpuset-cpus=1,2 and with --cpus=2

RPS	Max CPU (%) [1]	Avg. Latency	Startup
600,472	100	4.11	211
601,953	100	4.11	279
607,330	100	4.17	308
603,674	100	4.27	216
603,054	99	4.05	207

with --cpuset-cpus=1,2 and with --cpus=1.999999999

RPS	Max CPU (%) [1]	Avg. Latency	Startup
399,743	169	6.17	292
405,412	160	6.2	236
408,685	173	6.24	260
492,366	196	5.07	233
414,865	177	6.07	323

with --cpuset-cpus=1,2 and with --cpus=1.7

RPS	Max CPU (%) [1]	Avg. Latency	Startup
431,411	170	6.2	280
418,328	156	6.1	322
407,473	153	6.25	276
433,062	156	5.89	210
428,323	163	6.01	306

We can see a clear drop in performance between passing --cpus=1.999999999 or --cpus=2, and the performance between --cpus=1.999999999 and --cpus=1.7 is similar even though we reduced by 15% the total budget of the CPU.

My hypothesis is that the difference between 1.999999999 and 2 are negligible from the OS / scheduler perspective and that the difference is explained by the different choices taken by the runtime in term of ThreadPool, GC and others. To verify that hypothesis I will run with 2 locally compiled versions of coreclr, with and without dotnet/coreclr#23398, and with --cpus=1.999999999. I'll post the results here.

[1] the Max CPU (%) should be between 0-100%, but in 2. and 3. it's over 100%. That's because the total time spent on CPU is divided by Environment.ProcessorCount which in the case of .NET Core 3.0 returns the rounded-down value passed to --cpus

tmds · 2019-03-25T19:37:27Z

@luhenry rounding up makes sense. If your container has 1.7 cpu, it will run on more than 1 core. Your benchmark results confirm this.

This focuses on better supporting Docker CLI's parameter `--cpus`, which limits the amount of CPU time available to the container (ex: 1.8 means 180% CPU time, ie on 2 cores 90% for each core, on 4 cores 45% on each core, etc.) All the runtime components depending on the number of processors available are: - ThreadPool - GC - `Environment.ProcessorCount` via `SystemNative::GetProcessorCount` - `SimpleRWLock::m_spinCount` - `BaseDomain::m_iNumberOfProcessors` (it's used to determine the GC heap to affinitize to) All the above components take advantage of `--cpus` via `CGroup::GetCpuLimit` with dotnet#12797, allowing to optimize performance in a container/machine with limited resources. This makes sure the runtime components makes the best use of available resources. In the case of `Environment.ProcessorCount`, the behavior is such that passing `--cpus=1.5` on a machine with 8 processors will return `1` as shown in https://github.com/dotnet/coreclr/issues/22302#issuecomment-459092299. This behavior is not consistent with [Windows Job Objects](https://docs.microsoft.com/en-us/windows/desktop/api/winnt/ns-winnt-jobobject_cpu_rate_control_information) which still returns the number of processors for the container/machine even if it only gets parts of the total number of cycles. This behavior is erroneous because the container still has access to the full range of processors on the machine, and only its _processor time_ is limited. For example, in the case of a 4 processors machine, with a value of `--cpus=1.8`, there can be 4 threads running in parallel even though each thread will only get `1.8 / 8 = .45` or 45% of all cycles of each processor. The work consist in reverting the behavior of `SystemNative::GetProcessorCount` to pre dotnet#12797.

luhenry · 2019-03-27T18:20:52Z

dotnet/coreclr#23398 fixes the situation for the ThreadPool and the GC when passing --cpus=1.999999999. It now simply rounds to the nearest integer.

I am now verifying the case of --cpus=1.4999999999 and --cpus=1.5 to verify there is no performance cliff between both cases. In the case where we are observing a less than ideal drop, we should explore rounding up instead of rounding to nearest integer.

luhenry · 2019-03-27T20:15:47Z

with --cpuset-cpus=1,2 --cpus=1.5 + rounding to nearest integer

RPS	Max CPU (%) [1]	Avg. Latency	Startup
469,780	151	7.69	288
479,875	151	7.47	284
468,337	151	7.63	270
479,624	151	7.58	203
482,347	151	7.52	259

with --cpuset-cpus=1,2 --cpus=1.499999999 + rounding to nearest integer

RPS	Max CPU (%) [1]	Avg. Latency	Startup
431,592	148	6.53	265
419,993	145	6.29	237
405,770	143	6.62	248
414,939	140	6.26	257
414,848	148	6.61	285

with --cpuset-cpus=1,2 --cpus=1.499999999 + rounding up

RPS	Max CPU (%) [1]	Avg. Latency	Startup
417,095	150	6.62	191
428,464	141	6.48	246
428,544	147	6.35	196
423,698	148	6.41	272
436,548	145	6.35	220

with --cpuset-cpus=1,2 --cpus=1.000000001 + rounding up

RPS	Max CPU (%) [1]	Avg. Latency	Startup
320,702	101	11.69	288
310,606	100	12.91	262
317,907	101	13.56	309
316,390	102	12.88	195
319,566	104	12.09	272

with --cpuset-cpus=1,2 --cpus=1 + rounding up

RPS	Max CPU (%) [1]	Avg. Latency	Startup
316,496	102	12.63	282
316,616	103	13.19	411
317,123	102	12.57	208
315,176	101	13.5	203
317,981	102	12.7	261

We can observe in the --cpus=1.499999999 case that rounded up gives slightly better results, and that the --cpus=1.000000001 and --cpus=1 are equivalent in terms of performance.

Based on that, I am proposing to go with rounding up in order to maximize the use of available CPU.

@janvorli @jkotas @sergiy-k

This focuses on better supporting Docker CLI's parameter `--cpus`, which limits the amount of CPU time available to the container (ex: 1.8 means 180% CPU time, ie on 2 cores 90% for each core, on 4 cores 45% on each core, etc.) All the runtime components depending on the number of processors available are: - ThreadPool - GC - `Environment.ProcessorCount` via `SystemNative::GetProcessorCount` - `SimpleRWLock::m_spinCount` - `BaseDomain::m_iNumberOfProcessors` (it's used to determine the GC heap to affinitize to) All the above components take advantage of `--cpus` via `CGroup::GetCpuLimit` with dotnet#12797, allowing to optimize performance in a container/machine with limited resources. This makes sure the runtime components makes the best use of available resources. In the case of `Environment.ProcessorCount`, the behavior is such that passing `--cpus=1.5` on a machine with 8 processors will return `1` as shown in https://github.com/dotnet/coreclr/issues/22302#issuecomment-459092299. This behavior is not consistent with [Windows Job Objects](https://docs.microsoft.com/en-us/windows/desktop/api/winnt/ns-winnt-jobobject_cpu_rate_control_information) which still returns the number of processors for the container/machine even if it only gets parts of the total number of cycles. This behavior is erroneous because the container still has access to the full range of processors on the machine, and only its _processor time_ is limited. For example, in the case of a 4 processors machine, with a value of `--cpus=1.8`, there can be 4 threads running in parallel even though each thread will only get `1.8 / 8 = .45` or 45% of all cycles of each processor. The work consist in reverting the behavior of `SystemNative::GetProcessorCount` to pre dotnet#12797.

VSadov · 2019-03-27T23:32:54Z

In the past when I needed to measure changes in ThreadPool latency, I was pointed to the following snippet:
dotnet/coreclr#13670 (comment)

Not sure if the benchmark exists in some other form.
That could be used to measure TP performance under various conditions, just run with:

	TaskBurstWorkThroughput 1PcT 000.5PcWi
	TaskBurstWorkThroughput 1PcT 001.0PcWi
	TaskBurstWorkThroughput 1PcT 004.0PcWi
	TaskBurstWorkThroughput 1PcT 016.0PcWi
	TaskBurstWorkThroughput 1PcT 064.0PcWi
	TaskBurstWorkThroughput 1PcT 256.0PcWi
	TaskSustainedWorkThroughput 1PcT
	ThreadPoolBurstWorkThroughput 1PcT 000.5PcWi
	ThreadPoolBurstWorkThroughput 1PcT 001.0PcWi
	ThreadPoolBurstWorkThroughput 1PcT 004.0PcWi
	ThreadPoolBurstWorkThroughput 1PcT 016.0PcWi
	ThreadPoolBurstWorkThroughput 1PcT 064.0PcWi
        ThreadPoolBurstWorkThroughput 1PcT 256.0PcWi
	ThreadPoolSustainedWorkThroughput 1PcT

I think that the initial worker limit should not change and be just the physical number of cores (i.e. the level of parallelizm available).
However the benchmark may show something different. . .

Where CPU quota could be more interesting is in GetCPUBusyTime_NT.

That is used to detect situations where CPU is lightly loaded and we are not making progress with tasks (which together indicates that workers are blocked), then we allow more workers as a progress guarantee/deadlock prevention measure.

The part where we reason about "CPU is lightly loaded" could give biased reading when quotas are involved since it currently expects that all cores in affinity mask can do 100%.
Under quota we may misdetect "lightly loaded" situation and insert workers when no more CPU power is available.

CC: @kouvel

VSadov · 2019-03-27T23:42:42Z

Note - there is also PAL_GetCPUBusyTime - not NT version, perhaps more interesting in a container

This focuses on better supporting Docker CLI's parameter `--cpus`, which limits the amount of CPU time available to the container (ex: 1.8 means 180% CPU time, ie on 2 cores 90% for each core, on 4 cores 45% on each core, etc.) All the runtime components depending on the number of processors available are: - ThreadPool - GC - `Environment.ProcessorCount` via `SystemNative::GetProcessorCount` - `SimpleRWLock::m_spinCount` - `BaseDomain::m_iNumberOfProcessors` (it's used to determine the GC heap to affinitize to) All the above components take advantage of `--cpus` via `CGroup::GetCpuLimit` with dotnet#12797, allowing to optimize performance in a container/machine with limited resources. This makes sure the runtime components makes the best use of available resources. In the case of `Environment.ProcessorCount`, the behavior is such that passing `--cpus=1.5` on a machine with 8 processors will return `1` as shown in https://github.com/dotnet/coreclr/issues/22302#issuecomment-459092299. This behavior is not consistent with [Windows Job Objects](https://docs.microsoft.com/en-us/windows/desktop/api/winnt/ns-winnt-jobobject_cpu_rate_control_information) which still returns the number of processors for the container/machine even if it only gets parts of the total number of cycles. This behavior is erroneous because the container still has access to the full range of processors on the machine, and only its _processor time_ is limited. For example, in the case of a 4 processors machine, with a value of `--cpus=1.8`, there can be 4 threads running in parallel even though each thread will only get `1.8 / 8 = .45` or 45% of all cycles of each processor. The work consist in reverting the behavior of `SystemNative::GetProcessorCount` to pre dotnet#12797.

mrmartan · 2019-07-25T16:47:40Z

As @janvorli said:

For this discussion, I think it is also good to get full understanding on how the scheduling in the container works.

Whatever you did is IMHO not sufficient.
Your discussion here turns around two docker arguments. It in fact runs deeper, to the way those settings are propagated to OS kernel. And there are more docker options than those two (and I suppose docker can change the way they map those onto OS). I know Kubernetes uses different docker options than those discussed here as well. See: https://medium.com/@betz.mark/understanding-resource-limits-in-kubernetes-cpu-time-9eff74d3161b

I could give you more detailed info if you'd like. First a description of our use case of ASP.NET Core APIs running in Kubernetes.

tl;dr The way .NET currently works we are forced to run our docker containers without specifying any limits.

We are running one of APIs in 14 instances (docker containers spread on our Kubernetes cluster - workers are CentOS based). Base image used is mcr.microsoft.com/dotnet/core/aspnet:2.2.6. When no CPU limits are specified it servers around 1k HTTP request per second consuming cca 350m of CPU per instance (kube pod). The total number of processors reported by .NET is 224 (16 per instance since each kubernetes worker is a VM with 16 logical CPUs). Setting kube CPU limit anywhere near our real usage causes .NET to report 14 CPUs (one per instance, ie. each dotnet process bahaves as if running on a single-core machine) and makes our system unable to handle its load.

To cope we have to increase scaling about three times (to 42 instances) which obviously consumes way more resources than required (memory especially). Setting the limit higher than necessary does not protect us from overconsuming resources while limiting concurrency (whatever having e.g. Environment.ProcessorCount == 2 means for concurrency since I assume from OS perspective dotnet process threads are not limited to 2 cores <- as far as I can tell Kubernetes does not use the --cpuset-cpus docker option neither --cpus).

I have also tried .NET Core 3 Preview 7 and it behaves the same on Kubernetes as 2.2.6

janvorli · 2019-07-29T10:02:11Z

While the discussion was about two docker arguments, coreclr runtime reads the stuff from cgroups (cpu.cfs_quota_us and cpu.cfs_period_us)and from the process affinity and not from the docker. So it is independent on how the limits were specified.
There was a discussion above on what to report in Environment.ProcessorCount on a system that has a limit specified by these cgroups settings. Whether to report the number of processor cores the .NET process can run on or the cpu.cfs_quota_us / cpu.cfs_period_us. The final consensus, also based on performance results of some asp.net benchmarks and other measurements was to make it report the cpu.cfs_quota_us / cpu.cfs_period_us.

I would like to understand why setting the CPU limit near your real CPU usage causes the performance issues. I have a couple of questions:

Does your app explicitly make decisions based on Environment.ProcessorCount or it is just due to what ASP.NET does?
What are the CPU limits you were setting?
How large is the perf degradation you are seeing?
What happens perf-wise if you double the CPU limits?

mrmartan · 2019-07-29T11:41:55Z

Does your app explicitly make decisions based on Environment.ProcessorCount or it is just due to what ASP.NET does?

No, it does not.

What are the CPU limits you were setting?

600m

How large is the perf degradation you are seeing?

We have to increase scaling about 3 times to handle the same load. I can't experiment with our Production environment as much as I would like :)

I should probably add that the app is heavy on I/O, executing dozens of async network I/O operations per incoming request.

What happens perf-wise if you double the CPU limits?

That's what I intend to test. To set the limits so that Environment.ProcessorCount > 1. Mainly because I read that when Environment.ProcessorCount == 1 CoreCLR forces GC to run in workstation mode.

mrmartan · 2019-07-31T16:18:30Z

@janvorli I did not mean to suggest that this is implemented through some docker integration. I can see that it is done by means of kernel cgroups/cfs. I can also see from the discussion that it is fairly complex issue.

Still, setting CPU quota is not intended to limit concurrency (at least not in kubernetes) and the way CoreCLR tunes itself is not appropriate in all use cases.
I have found similar issue in Java and Java 10 uses the same approach as CoreCLR here. They did expose a knob to let the user override the available CPU count. So unless Environment.ProcessorCount changes to take its value from cgroups cpuset (and I don't suppose it will) it would be useful to be able to override Environment.ProcessorCount (or whatever is appropriate for .NET as not everything CPU related is driven by Environment.ProcessorCount) by the developer.

mrmartan · 2019-08-07T14:13:11Z

Just to follow up on this if anyone would feel like addressing it. I feel like the current state of .NET is not suitable for Kubernetes deployments.

Following is the result of changing kube pod CPU limit from unlimited (on 16 CPU kube workers) to 2000m (and only that change):

Average CPU consumption is 280m which is down from previous 350m.
Memory consumption dropped by two thirds.
Surprisingly even our response time 90th percentil dropped by about 5%

The way .NET works forces us to leave Kubernetes CPU limit at 2000m even though we do not require that much for this given use case.

jkotas · 2019-08-07T14:17:25Z

@mrmartan Could you please create a new github issue describing the problem? Comments on closed PRs are unlikely to get the response from the right people.

VSadov · 2019-08-07T18:24:39Z

@mrmartan - I have created an issue https://github.com/dotnet/coreclr/issues/26053 to my best understanding of your problem.
Please take a look and comment if that does/doesnot cover your case.

tmds · 2019-08-08T08:09:39Z

It would be good to do some perf tracing to understand why containers with low cpu allocation are not performing as good as expected.

Such low allocations are common on Kubernetes, so it may be worth doing continuous perf benchmarks.

cc @davidfowl @sebastienros @adamsitnik

mrmartan · 2019-08-08T10:10:36Z

I now see here two issues. dotnet/coreclr#26053 @VSadov opened (I will try and add some details to it) and possibly the one @tmds is hinting at, i.e. is low performance expected with low CPU (single core; EDIT: or rather perceived by .NET as single core) allocation.

I have a screenshot form a test I performed on our test Kubernetes cluster (mostly same configuration as discused above). In the test there was only one replica of the app in question. With kube CPU limit set to 700m the actual CPU consumption was about 700m and the result is on the left side of the image. With kube CPU limit at 2000m the actual CPU consumption was 750m and the result is on the right side of the image.

The image is from SuperBenchmarker (unfortunately I cropped it without the scale ☹️ )

Pink is latency, green is RPS (requests per second; maximum was 650 RPS)

tmds · 2019-12-10T14:19:30Z

In the test there was only one replica of the app in question. With kube CPU limit set to 700m the actual CPU consumption was about 700m and the result is on the left side of the image. With kube CPU limit at 2000m the actual CPU consumption was 750m and the result is on the right side of the image.

These numbers are hard to compare.
With kube CPU limit at 700m, the app is restricted by CPU.
With kube CPU limit at 2000m, the app is no longer restricted by CPU.

mrmartan · 2019-12-10T15:55:46Z

I see your point. I no longer have the setup to test it again and post result here but at the time I could have given it 1000m and the result would be the same (while actual CPU consumption would not rise).

The case is the application was not CPU limited. CoreCLR was crippled since it thought it was running on a single core machine. On K8s/Linux 700m CPU limit does not imply 1 CPU/thread but CoreCLR behaved as though.

sergiy-k assigned janvorli Jan 31, 2019

sergiy-k assigned luhenry and unassigned janvorli Mar 18, 2019

luhenry reopened this Mar 25, 2019

luhenry closed this as completed Apr 12, 2019

msftgits transferred this issue from dotnet/coreclr Jan 31, 2020

msftgits added this to the 3.0 milestone Jan 31, 2020

Cloud33 mentioned this issue Jun 11, 2020

【please help】Unmanaged memory only increases but does not decrease dotnet/orleans#6556

Closed

ghost locked as resolved and limited conversation to collaborators Dec 14, 2020

improve detection of CPU limits when running inside container #11933

improve detection of CPU limits when running inside container #11933

Comments

wfurt commented Jan 30, 2019

wfurt commented Jan 30, 2019

luhenry commented Mar 20, 2019

luhenry commented Mar 20, 2019

VSadov commented Mar 20, 2019

luhenry commented Mar 20, 2019

janvorli commented Mar 20, 2019

VSadov commented Mar 20, 2019

Maoni0 commented Mar 20, 2019

wfurt commented Mar 20, 2019

luhenry commented Mar 20, 2019 • edited Loading

luhenry commented Mar 20, 2019 • edited Loading

VSadov commented Mar 20, 2019

wfurt commented Mar 20, 2019

luhenry commented Mar 20, 2019

stephentoub commented Mar 20, 2019

janvorli commented Mar 20, 2019

luhenry commented Mar 20, 2019

janvorli commented Mar 21, 2019

tmds commented Mar 21, 2019

luhenry commented Mar 22, 2019

jkotas commented Mar 22, 2019 • edited Loading

luhenry commented Mar 22, 2019

VSadov commented Mar 22, 2019

janvorli commented Mar 25, 2019

luhenry commented Mar 25, 2019 • edited Loading

tmds commented Mar 25, 2019

luhenry commented Mar 27, 2019

luhenry commented Mar 27, 2019

VSadov commented Mar 27, 2019

VSadov commented Mar 27, 2019

mrmartan commented Jul 25, 2019 • edited Loading

janvorli commented Jul 29, 2019

mrmartan commented Jul 29, 2019 • edited Loading

mrmartan commented Jul 31, 2019

mrmartan commented Aug 7, 2019

jkotas commented Aug 7, 2019

VSadov commented Aug 7, 2019

tmds commented Aug 8, 2019

mrmartan commented Aug 8, 2019 • edited Loading

tmds commented Dec 10, 2019

mrmartan commented Dec 10, 2019

luhenry commented Mar 20, 2019 •

edited

Loading

luhenry commented Mar 20, 2019 •

edited

Loading

jkotas commented Mar 22, 2019 •

edited

Loading

luhenry commented Mar 25, 2019 •

edited

Loading

mrmartan commented Jul 25, 2019 •

edited

Loading

mrmartan commented Jul 29, 2019 •

edited

Loading

mrmartan commented Aug 8, 2019 •

edited

Loading