-
Notifications
You must be signed in to change notification settings - Fork 616
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support Unified Cgroups (cgroups v2) #3127
Conversation
c851e59
to
13a56a9
Compare
0288bd3
to
123c4e8
Compare
@@ -39,9 +39,8 @@ func dockerStatsToContainerStats(dockerStats *types.StatsJSON) (*ContainerStats, | |||
} | |||
|
|||
func validateDockerStats(dockerStats *types.StatsJSON) error { | |||
// The length of PercpuUsage represents the number of cores in an instance. | |||
if len(dockerStats.CPUStats.CPUUsage.PercpuUsage) == 0 || numCores == uint64(0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't quite figure out why we decided to check len(dockerStats.CPUStats.CPUUsage.PercpuUsage) == 0
in the first place, but looks like we don't reference this array anywhere else, and using runtime.NumCPU
is good enough for me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, we think that len(dockerStats.CPUStats.CPUUsage.PercpuUsage) == 0
was redundant or a mistake? I'm hesitant to remove something we don't seem to have clarity on why it was introduced.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PercpuUsage doesn't appear to be used anywhere in agent, and it's not available anymore with cgroupv2. I'm not sure how else we can check that it's safe to remove? FWIW there isn't really any alternative to just removing the check from what I can tell, since cgroupv2 stats simply dont provide this level of granularity.
If you would prefer we could just have an if-statement here and continue checking this array when we're using cgroupv1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm guessing the intention was to do a simple sanity check of the stats API.
There is an alternative to the PercpuUsage
field, online_cpus
, according to docker API doc, which should be available when either v1 or v2 cgroups is used:
number_cpus = lenght(cpu_stats.cpu_usage.percpu_usage) or cpu_stats.online_cpus
But then I also saw this comment
If either precpu_stats.online_cpus or cpu_stats.online_cpus is nil then for compatibility with older daemons the length of the corresponding cpu_usage.percpu_usage array should be used.
Earliest docker API reference I could find is for 1.18 and it does not contain the online_cpu
field.
Since ECSAgent-supported docker api version can go back to 1.17, which can still work with 20.10 docker engine (based on this doc) which is when docker introduced cgroupv2, it's not 100% safe for us to check online_cpus
when cgroupv2 is used.
We can still keep the v1 check, but I'm not sure of a safe way to validate docker stats numCPU when using cgroupv2.
1db1587
to
b7ebe56
Compare
closes aws/containers-roadmap#1535 closes aws#3117 This adds support for task-level resource limits when running on unified cgroups (aka cgroups v2) with the systemd cgroup driver. Cgroups v2 has introduced a cgroups format that is not backward compatible with cgroups v1. In order to support both v1 and v2, we have added a config variable to detect which cgroup version the ecs agent is running with. The containerd/cgroups library is used to determine which mode it is using on agent startup. Cgroups v2 no longer can provide per-cpu usage stats, so this validation was removed since we never used it either.
if err != nil { | ||
return fmt.Errorf("cgroup resource [%s]: setup cgroup: unable to set use hierarchy flag: %w", cgroup.taskARN, err) | ||
if !config.CgroupV2 { | ||
// enabling cgroup memory hierarchy by doing 'echo 1 > memory.use_hierarchy' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was looking into this memory.use_hierarchy
flag, some information I found on kernel doc (see 6. Hierarchy support). So this basically enables recursive accounting and reclaiming - but I couldn't find the cgroupv2 equivalent (if any). Were we able to verify that container memory consumption gets charged to task cgroup as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding was that since cgroups v2 is by default a "unified hierarchy", this flag no longer applies. From my testing this seems to be the case because limits set on a task slice with two containers are applied to both containers in aggregate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think applying this flag in v1 will charge subcgroup memory usage to its parent (bottom-up). The allocation (top-down) does not get affected with or without the flag.
This is my understanding: if a task cgroup has two containers, task has hard limit 100MB with each container getting 50MB. Now one of the two containers attempts to allocate 200MB
- without the flag, only the container will be killed
- with the flag, because the memory allocation got charged to the task, the task will also be oom-killed.
Can we verify if that's the behavior?
And tbh, I'm not entirely sure why we want to enable this hierarchical charging. If a non-essential container gets killed, we should still let the task run...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if a task cgroup has two containers, task has hard limit 100MB with each container getting 50MB
I can test this but I'm fairly certain that it works that both containers have a shared limit of 100MB. So if one container is using only 20MB the other can use up to 80MB.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested the scenario of a task with 512MB of total memory, with one container allocating 100MB and another 450MB. This does seem to be correct, ie...
without the flag, only the container will be killed
On AL2022 only the container allocating 450MB is being OOM-killed. I'm not sure how exactly does it decide which container to kill, it might just be that it kills whichever container breaches the memory limit first?
with the flag, because the memory allocation got charged to the task, the task will also be oom-killed.
Yes I confirmed that on AL2, when one container hits the limit, BOTH containers are killed by the OOM-killer.
Will investigate how we can turn on use_hierarchy on cgroupv2...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will investigate how we can turn on use_hierarchy on cgroupv2...
I would actually suggest that we turn off use_hierarchy
on cgroupv1.
If I understanding it correctly, the only time that one container failure should bring down the whole task is when that container is the essential container. On the other hand, if a container is non-essential, it being OOM killed (or terminated in any other way for that matter) should not affect the other containers in the task.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To follow up on this, use_hierarchy
functionality is not completely available on cgroup v2. The memory.oom.group
flag is close (see https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html), but it causes the entire container to be killed, rather than just the process within the container. We will go forward with the kernel default behavior of only killing the "bulkiest" process when the task-level memory limit is reached.
264705f
to
438de7e
Compare
@@ -39,9 +39,8 @@ func dockerStatsToContainerStats(dockerStats *types.StatsJSON) (*ContainerStats, | |||
} | |||
|
|||
func validateDockerStats(dockerStats *types.StatsJSON) error { | |||
// The length of PercpuUsage represents the number of cores in an instance. | |||
if len(dockerStats.CPUStats.CPUUsage.PercpuUsage) == 0 || numCores == uint64(0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, we think that len(dockerStats.CPUStats.CPUUsage.PercpuUsage) == 0
was redundant or a mistake? I'm hesitant to remove something we don't seem to have clarity on why it was introduced.
* Support Unified Cgroups (cgroups v2) closes aws/containers-roadmap#1535 closes aws#3117 This adds support for task-level resource limits when running on unified cgroups (aka cgroups v2) with the systemd cgroup driver. Cgroups v2 has introduced a cgroups format that is not backward compatible with cgroups v1. In order to support both v1 and v2, we have added a config variable to detect which cgroup version the ecs agent is running with. The containerd/cgroups library is used to determine which mode it is using on agent startup. Cgroups v2 no longer can provide per-cpu usage stats, so this validation was removed since we never used it either. * wip * update cgroups library with nil panic bugfix * Initialize and toggle cgroup controllers
* Support Unified Cgroups (cgroups v2) closes aws/containers-roadmap#1535 closes #3117 This adds support for task-level resource limits when running on unified cgroups (aka cgroups v2) with the systemd cgroup driver. Cgroups v2 has introduced a cgroups format that is not backward compatible with cgroups v1. In order to support both v1 and v2, we have added a config variable to detect which cgroup version the ecs agent is running with. The containerd/cgroups library is used to determine which mode it is using on agent startup. Cgroups v2 no longer can provide per-cpu usage stats, so this validation was removed since we never used it either. * wip * update cgroups library with nil panic bugfix * Initialize and toggle cgroup controllers
Summary
Implements part of aws/containers-roadmap#1535
This adds support for task-level resource limits when running on unified cgroups (aka cgroups v2) with the systemd cgroup driver.
Cgroups v2 has introduced a cgroups format that is not backward compatible with cgroups v1. In order to support both v1 and v2, we have added a config variable to detect which cgroup version the ecs agent is running with. The containerd/cgroups library is used to determine which mode it is using on agent startup.
When running with cgroups v2 docker containers are normally created under the path
/sys/fs/cgroup/system.slice/docker-${CONTAINER_ID}.scope
path.When using task resource limits, the task receives it's own cgroup slice, and the containers are created under the path
/sys/fs/cgroup/ecstasks.slice/ecstasks-${TASK_ID}.slice/docker-${CONTAINER_ID}.scope
. Theecstasks-${TASK_ID}.slice
slice is created with the resource limits specified in the task definition, and every docker container under it is subject to these limits.For a good doc and overview of how these cgroup hierarchies now work with the systemd driver, see https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/resource_management_guide/sec-default_cgroup_hierarchies
Because docker uses the systemd cgroup driver by default with cgroup v2, resource limit actions are no longer logged by the agent or docker. OOM-kill messages can now be found in the system journal (
journalctl --system
) or/var/log/messages
.Testing
New tests cover the changes: yes
Functional tests run to verify that v1 behavior is unchanged.
Manually tested running on a cgroup v2 system and verified that task-level mem/cpu limits are applied.
Ran the full functional testing suite on a cgroupv2 system (AL2022) and verified all tests pass.
Description for the changelog
Enhancement - Support for unified cgroups and the systemd cgroup driver.
Licensing
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.