Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect memory utilization under cgroup2 on Amazon Linux 2022 #3323

Closed
ltm opened this issue Jul 29, 2022 · 3 comments
Closed

Incorrect memory utilization under cgroup2 on Amazon Linux 2022 #3323

ltm opened this issue Jul 29, 2022 · 3 comments
Labels
kind/tracking This issue is being tracked internally

Comments

@ltm
Copy link

ltm commented Jul 29, 2022

Summary

The memory utilization metric reported to CloudWatch doesn't match the metric reported by the Docker CLI when running ECS containers on Amazon Linux 2022. This is a regression of #280.

Description

Amazon Linux 2022 now uses cgroup2 and as such the memory stats reported by Docker have changed from statsV1 to statsV2. Notably the statsV2 memory stats no longer include the cache property.

Since #582 the memory utilization reported to CloudWatch has been calculated as (memory_stats.usage - memory_stats.stats.cache) / memory_stats.limit which matched the Docker CLI at the time. However, with the cache property missing from the statsV2 memory stats, this calculation is no longer accurate.

The Docker CLI currently calculates the memory utilization as (memory_stats.usage - memory_stats.stats.total_inactive_file) / memory_stats.limit under cgroup1 and as (memory_stats.usage - memory_stats.stats.inactive_file) / memory_stats.limit under cgroup2.

Expected Behavior

The CloudWatch memory utilization metric should match the metric reported by the Docker CLI.

Observed Behavior

Container with 20.97% memory utilization according to the Docker CLI is reported as 54.30% memory utilization according to CloudWatch.

Environment Details

$ docker info
Client:
 Context:    default
 Debug Mode: false

Server:
 Containers: 3
  Running: 3
  Paused: 0
  Stopped: 0
 Images: 3
 Server Version: 20.10.13
 Storage Driver: overlay2
  Backing Filesystem: xfs
  Supports d_type: true
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: systemd
 Cgroup Version: 2
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 9cc61520f4cd876b86e77edfeb88fbcd536d1f9d
 runc version: f46b6ba2c9314cfc8caae24a32ec5fe9ef1059fe
 init version: de40ad0
 Security Options:
  seccomp
   Profile: default
  cgroupns
 Kernel Version: 5.15.43-20.123.amzn2022.x86_64
 Operating System: Amazon Linux 2022
 OSType: linux
 Architecture: x86_64
 CPUs: 2
 Total Memory: 7.472GiB
 Name: ip-10-2-128-116.us-west-2.compute.internal
 ID: 6PVI:LERT:ND5G:UZKQ:7GG2:TTOF:OKQM:QFJW:EDPN:PS6V:ON22:YAR6
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

Supporting Log Snippets

Docker stats JSON:

{
  "memory_stats": {
    "usage": 3985747968,
    "stats": {
      "active_anon": 0,
      "active_file": 1489485824,
      "anon": 970752,
      "anon_thp": 0,
      "file": 3935731712,
      "file_dirty": 0,
      "file_mapped": 0,
      "file_writeback": 0,
      "inactive_anon": 970752,
      "inactive_file": 2446245888,
      "kernel_stack": 16384,
      "pgactivate": 367031,
      "pgdeactivate": 0,
      "pgfault": 8135005,
      "pglazyfree": 0,
      "pglazyfreed": 0,
      "pgmajfault": 18948,
      "pgrefill": 0,
      "pgscan": 0,
      "pgsteal": 0,
      "shmem": 0,
      "slab": 48933808,
      "slab_reclaimable": 48735888,
      "slab_unreclaimable": 197920,
      "sock": 0,
      "thp_collapse_alloc": 0,
      "thp_fault_alloc": 0,
      "unevictable": 0,
      "workingset_activate": 0,
      "workingset_nodereclaim": 0,
      "workingset_refault": 0
    }
    "limit": 7340032000
  }
}
@Realmonia
Copy link
Contributor

Hi,

Thanks for reporting this! Looking into it.

@fierlion fierlion added the kind/tracking This issue is being tracked internally label Aug 25, 2022
@sparrc
Copy link
Contributor

sparrc commented Aug 29, 2022

I have instrumented the ecs-agent with logging to confirm that counting "inactive_file" instead of "cache" does in fact report a more accurate memory usage to cloudwatch, or at least it more accurately reflects what the docker-cli reports.

See memory usage from docker cli, and the "FIXED" memory usage below, which is using inactive_file instead of cache:

on AL2022:

src/github.com/aws/amazon-ecs-agent % docker stats f6bf24ec4486 --no-stream
CONTAINER ID   NAME                                             CPU %     MEM USAGE / LIMIT   MEM %     NET I/O       BLOCK I/O     PIDS
f6bf24ec4486   ecs-tomcat-sample-4-main1-f4e2c581ecd7d49c9201   0.05%     51.85MiB / 256MiB   20.25%    1.14kB / 0B   0B / 2.03MB   30

src/github.com/aws/amazon-ecs-agent % tail -f /var/log/ecs/ecs-agent.log   
level=info time=2022-08-29T23:00:50Z msg="DEBUG USAGE: 51.8906 mb    FIXED: 51.8516" module=utils_unix.go
level=info time=2022-08-29T23:00:51Z msg="DEBUG USAGE: 51.8906 mb    FIXED: 51.8516" module=utils_unix.go

@sparrc
Copy link
Contributor

sparrc commented Aug 29, 2022

FWIW we also seem to have the same issue on AL2 (with respect to docker-cli memory usage accuracy). @ltm can you confirm if you only see this issue on AL2022? Does your application maybe create some "cache" memory that offsets the difference on AL2?

on AL2:

level=info time=2022-08-29T23:28:09Z msg="DEBUG PRE: 52.605      FIXED: 52.566" module=utils_unix.go
level=info time=2022-08-29T23:28:10Z msg="DEBUG PRE: 52.605      FIXED: 52.566" module=utils_unix.go
^C
CONTAINER INST src/github.com/aws/amazon-ecs-agent % docker stats fe5022acf01f --no-stream
CONTAINER ID   NAME                                             CPU %     MEM USAGE / LIMIT   MEM %     NET I/O     BLOCK I/O   PIDS
fe5022acf01f   ecs-tomcat-sample-4-main1-dceffee1c4abc3cedf01   0.05%     52.57MiB / 256MiB   20.54%    820B / 0B   0B / 0B     31

sparrc added a commit to sparrc/amazon-ecs-agent that referenced this issue Aug 30, 2022
"cache" memory stat no longer exists in cgroupv2.

docker cli subtracts "inactive_file" for the overall mem usage
calculation, so do the same for cgroupv2.

closes aws#3323
sparrc added a commit to sparrc/amazon-ecs-agent that referenced this issue Aug 30, 2022
"cache" memory stat no longer exists in cgroupv2.

docker cli subtracts "inactive_file" for the overall mem usage
calculation, so do the same for cgroupv2.

closes aws#3323
sparrc added a commit that referenced this issue Aug 30, 2022
"cache" memory stat no longer exists in cgroupv2.

docker cli subtracts "inactive_file" for the overall mem usage
calculation, so do the same for cgroupv2.

closes #3323
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/tracking This issue is being tracked internally
Projects
None yet
Development

No branches or pull requests

4 participants