Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Unified Cgroups (cgroups v2) #3127

Merged
merged 4 commits into from
Mar 2, 2022
Merged

Conversation

sparrc
Copy link
Contributor

@sparrc sparrc commented Feb 11, 2022

Summary

Implements part of aws/containers-roadmap#1535

This adds support for task-level resource limits when running on unified cgroups (aka cgroups v2) with the systemd cgroup driver.

Cgroups v2 has introduced a cgroups format that is not backward compatible with cgroups v1. In order to support both v1 and v2, we have added a config variable to detect which cgroup version the ecs agent is running with. The containerd/cgroups library is used to determine which mode it is using on agent startup.

When running with cgroups v2 docker containers are normally created under the path /sys/fs/cgroup/system.slice/docker-${CONTAINER_ID}.scope path.

When using task resource limits, the task receives it's own cgroup slice, and the containers are created under the path /sys/fs/cgroup/ecstasks.slice/ecstasks-${TASK_ID}.slice/docker-${CONTAINER_ID}.scope. The ecstasks-${TASK_ID}.slice slice is created with the resource limits specified in the task definition, and every docker container under it is subject to these limits.

For a good doc and overview of how these cgroup hierarchies now work with the systemd driver, see https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/resource_management_guide/sec-default_cgroup_hierarchies

Because docker uses the systemd cgroup driver by default with cgroup v2, resource limit actions are no longer logged by the agent or docker. OOM-kill messages can now be found in the system journal (journalctl --system) or /var/log/messages.

Testing

New tests cover the changes: yes

Functional tests run to verify that v1 behavior is unchanged.

Manually tested running on a cgroup v2 system and verified that task-level mem/cpu limits are applied.

Ran the full functional testing suite on a cgroupv2 system (AL2022) and verified all tests pass.

Description for the changelog

Enhancement - Support for unified cgroups and the systemd cgroup driver.

Licensing

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@sparrc sparrc force-pushed the cgroupv2 branch 8 times, most recently from c851e59 to 13a56a9 Compare February 11, 2022 23:48
@sparrc sparrc changed the title wip Support task-level resource limits on Unified Cgroups (cgroups v2) Feb 14, 2022
@sparrc sparrc changed the base branch from dev to feature/cgroupv2 February 14, 2022 18:52
@sparrc sparrc force-pushed the cgroupv2 branch 3 times, most recently from 0288bd3 to 123c4e8 Compare February 15, 2022 18:19
@sparrc sparrc changed the title Support task-level resource limits on Unified Cgroups (cgroups v2) Support Unified Cgroups (cgroups v2) Feb 15, 2022
agent/api/task/task_linux.go Outdated Show resolved Hide resolved
agent/api/task/task_linux.go Outdated Show resolved Hide resolved
agent/api/task/task_linux.go Outdated Show resolved Hide resolved
@@ -39,9 +39,8 @@ func dockerStatsToContainerStats(dockerStats *types.StatsJSON) (*ContainerStats,
}

func validateDockerStats(dockerStats *types.StatsJSON) error {
// The length of PercpuUsage represents the number of cores in an instance.
if len(dockerStats.CPUStats.CPUUsage.PercpuUsage) == 0 || numCores == uint64(0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't quite figure out why we decided to check len(dockerStats.CPUStats.CPUUsage.PercpuUsage) == 0 in the first place, but looks like we don't reference this array anywhere else, and using runtime.NumCPU is good enough for me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, we think that len(dockerStats.CPUStats.CPUUsage.PercpuUsage) == 0 was redundant or a mistake? I'm hesitant to remove something we don't seem to have clarity on why it was introduced.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PercpuUsage doesn't appear to be used anywhere in agent, and it's not available anymore with cgroupv2. I'm not sure how else we can check that it's safe to remove? FWIW there isn't really any alternative to just removing the check from what I can tell, since cgroupv2 stats simply dont provide this level of granularity.

If you would prefer we could just have an if-statement here and continue checking this array when we're using cgroupv1?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm guessing the intention was to do a simple sanity check of the stats API.
There is an alternative to the PercpuUsage field, online_cpus, according to docker API doc, which should be available when either v1 or v2 cgroups is used:

number_cpus = lenght(cpu_stats.cpu_usage.percpu_usage) or cpu_stats.online_cpus

But then I also saw this comment

If either precpu_stats.online_cpus or cpu_stats.online_cpus is nil then for compatibility with older daemons the length of the corresponding cpu_usage.percpu_usage array should be used.

Earliest docker API reference I could find is for 1.18 and it does not contain the online_cpu field.

Since ECSAgent-supported docker api version can go back to 1.17, which can still work with 20.10 docker engine (based on this doc) which is when docker introduced cgroupv2, it's not 100% safe for us to check online_cpus when cgroupv2 is used.

We can still keep the v1 check, but I'm not sure of a safe way to validate docker stats numCPU when using cgroupv2.

@sparrc sparrc force-pushed the cgroupv2 branch 2 times, most recently from 1db1587 to b7ebe56 Compare February 16, 2022 00:01
closes aws/containers-roadmap#1535
closes aws#3117

This adds support for task-level resource limits when running on unified
cgroups (aka cgroups v2) with the systemd cgroup driver.

Cgroups v2 has introduced a cgroups format that is not backward compatible
with cgroups v1. In order to support both v1 and v2, we have added a config
variable to detect which cgroup version the ecs agent is running with.
The containerd/cgroups library is used to determine which mode it is using
on agent startup.

Cgroups v2 no longer can provide per-cpu usage stats, so this validation
was removed since we never used it either.
agent/config/config_linux.go Show resolved Hide resolved
agent/config/config_unix.go Show resolved Hide resolved
if err != nil {
return fmt.Errorf("cgroup resource [%s]: setup cgroup: unable to set use hierarchy flag: %w", cgroup.taskARN, err)
if !config.CgroupV2 {
// enabling cgroup memory hierarchy by doing 'echo 1 > memory.use_hierarchy'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was looking into this memory.use_hierarchy flag, some information I found on kernel doc (see 6. Hierarchy support). So this basically enables recursive accounting and reclaiming - but I couldn't find the cgroupv2 equivalent (if any). Were we able to verify that container memory consumption gets charged to task cgroup as well?

Copy link
Contributor Author

@sparrc sparrc Feb 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding was that since cgroups v2 is by default a "unified hierarchy", this flag no longer applies. From my testing this seems to be the case because limits set on a task slice with two containers are applied to both containers in aggregate.

Copy link
Contributor

@yinyic yinyic Feb 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think applying this flag in v1 will charge subcgroup memory usage to its parent (bottom-up). The allocation (top-down) does not get affected with or without the flag.
This is my understanding: if a task cgroup has two containers, task has hard limit 100MB with each container getting 50MB. Now one of the two containers attempts to allocate 200MB

  • without the flag, only the container will be killed
  • with the flag, because the memory allocation got charged to the task, the task will also be oom-killed.

Can we verify if that's the behavior?

And tbh, I'm not entirely sure why we want to enable this hierarchical charging. If a non-essential container gets killed, we should still let the task run...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if a task cgroup has two containers, task has hard limit 100MB with each container getting 50MB

I can test this but I'm fairly certain that it works that both containers have a shared limit of 100MB. So if one container is using only 20MB the other can use up to 80MB.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested the scenario of a task with 512MB of total memory, with one container allocating 100MB and another 450MB. This does seem to be correct, ie...

without the flag, only the container will be killed

On AL2022 only the container allocating 450MB is being OOM-killed. I'm not sure how exactly does it decide which container to kill, it might just be that it kills whichever container breaches the memory limit first?

with the flag, because the memory allocation got charged to the task, the task will also be oom-killed.

Yes I confirmed that on AL2, when one container hits the limit, BOTH containers are killed by the OOM-killer.

Will investigate how we can turn on use_hierarchy on cgroupv2...

Copy link
Contributor

@yinyic yinyic Feb 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will investigate how we can turn on use_hierarchy on cgroupv2...

I would actually suggest that we turn off use_hierarchy on cgroupv1.

If I understanding it correctly, the only time that one container failure should bring down the whole task is when that container is the essential container. On the other hand, if a container is non-essential, it being OOM killed (or terminated in any other way for that matter) should not affect the other containers in the task.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To follow up on this, use_hierarchy functionality is not completely available on cgroup v2. The memory.oom.group flag is close (see https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html), but it causes the entire container to be killed, rather than just the process within the container. We will go forward with the kernel default behavior of only killing the "bulkiest" process when the task-level memory limit is reached.

@sparrc sparrc force-pushed the cgroupv2 branch 2 times, most recently from 264705f to 438de7e Compare February 24, 2022 01:10
@@ -39,9 +39,8 @@ func dockerStatsToContainerStats(dockerStats *types.StatsJSON) (*ContainerStats,
}

func validateDockerStats(dockerStats *types.StatsJSON) error {
// The length of PercpuUsage represents the number of cores in an instance.
if len(dockerStats.CPUStats.CPUUsage.PercpuUsage) == 0 || numCores == uint64(0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, we think that len(dockerStats.CPUStats.CPUUsage.PercpuUsage) == 0 was redundant or a mistake? I'm hesitant to remove something we don't seem to have clarity on why it was introduced.

@sparrc sparrc merged commit d3e7150 into aws:feature/cgroupv2 Mar 2, 2022
@sparrc sparrc deleted the cgroupv2 branch March 2, 2022 19:23
sparrc added a commit to sparrc/amazon-ecs-agent that referenced this pull request Mar 30, 2022
* Support Unified Cgroups (cgroups v2)

closes aws/containers-roadmap#1535
closes aws#3117

This adds support for task-level resource limits when running on unified
cgroups (aka cgroups v2) with the systemd cgroup driver.

Cgroups v2 has introduced a cgroups format that is not backward compatible
with cgroups v1. In order to support both v1 and v2, we have added a config
variable to detect which cgroup version the ecs agent is running with.
The containerd/cgroups library is used to determine which mode it is using
on agent startup.

Cgroups v2 no longer can provide per-cpu usage stats, so this validation
was removed since we never used it either.

* wip

* update cgroups library with nil panic bugfix

* Initialize and toggle cgroup controllers
sparrc added a commit that referenced this pull request Apr 1, 2022
* Support Unified Cgroups (cgroups v2)

closes aws/containers-roadmap#1535
closes #3117

This adds support for task-level resource limits when running on unified
cgroups (aka cgroups v2) with the systemd cgroup driver.

Cgroups v2 has introduced a cgroups format that is not backward compatible
with cgroups v1. In order to support both v1 and v2, we have added a config
variable to detect which cgroup version the ecs agent is running with.
The containerd/cgroups library is used to determine which mode it is using
on agent startup.

Cgroups v2 no longer can provide per-cpu usage stats, so this validation
was removed since we never used it either.

* wip

* update cgroups library with nil panic bugfix

* Initialize and toggle cgroup controllers
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants