Container memory stats include filesystem cache usage #280

mikeybtn · 2016-01-07T04:43:08Z

On ecs, we noticed the MemoryUtilization graph for one service steadily growing, and began looking for a memory leak in that service.

On the vm, the graphed value was consistent with docker stats <container>. However, attaching to the container showed the RSS/VSIZE of all running process was stable over time, and less than the reported value. The investigation led us to moby/moby#10824 and its eventual resolution in docker-archive/libcontainer#518. In short, the usage figure seems to include the page cache.

It looks like aws-ecs-agent does not subtract the cache value when building/reporting container stats.

Would it make sense to report usage as CgroupStats.MemoryStats.Usage - CgroupStats.MemoryStats.Cache, to avoid confusing evictable memory with "real" usage? Or is there a technique we're neglecting that could avoid this situation altogether? Thanks!

The text was updated successfully, but these errors were encountered:

ankanm · 2016-01-08T23:50:33Z

+1

aaithal · 2016-01-13T18:17:26Z

@mikeybtn , thank you for reporting this issue. You are correct in pointing out that the ECS Agent does not subtract the memory cache value from usage while reporting memory stats. We also need to take in to account if subtracting the cache value is the proper way of handling this or if we should expose the a different metric to deal with the ‘Cache’ memory stat (as was the case here where docker-archive/libcontainer#518 fixed this behavior in docker-archive/libcontainer#506). We will get back to you when we have an update for this issue.

Thanks,
Anirudh

mikeybtn · 2016-01-19T13:43:30Z

@aaithal thanks for acknowledging! I can see how this might extend to a wider product issue, where you need to surface both in AWS charts and metrics. I'd be happy to help however I can.

tschutte · 2016-02-09T17:29:55Z

Thanks for identifying this issue. We also recently deployed a legacy Java app into ECS, and I was concerned seeing that the service is reporting memory utilization at a level almost twice our JVM -Xmx setting. My own investigation also agrees that ECS is reporting memory cache as part of overall memory utilization.

My only concern at this point is that we have a container improperly stopped by ECS due to memory usage exceeding the container definition, when the memory usage is only due to file cache and not our application.

tschutte · 2016-02-09T18:59:05Z

And now that I've done a little more reading to better understand what is really going on in Docker regarding memory usage, reporting, and limits, I now realize my question above is invalid since the ECS agent is not responsible for killing the container on OOM, but rather the kernel cgroup.

marklieberman · 2016-03-23T14:43:45Z

+1

I have a docker service that produces a lot of temporary files. ECS is killing it periodically even though our JVM metrics show the JVM usage is stable and well below the memory limit.

aaithal · 2016-03-28T15:46:37Z

@marklieberman, thank you for your feedback. Please note that the memory cgroup limit is a parameter set by the docker daemon (eventually enforced as memory cgroup limit by the kernel). This is also documented in the ECS documentation about task definitions.

hammerdr · 2016-05-18T17:44:29Z

Just going to +1 this issue. Important for us so that we can size-to-fit our containers appropriately. Specifically, this prevents us from running a low-memory process with low memory limits because we do not know the difference between process memory or cache (we could build something that does this for us, but we'll just make due with higher memory limits for now).

aaithal · 2016-11-23T23:30:48Z

Hi All,

I wanted to provide an update on this wrt changes proposed as per #582. Merging
this would mean that ECS Memory Utilization metrics for Cluster and Service
would exclude the number of bytes of page cache memory. I also wanted to share
the results of the experiments I ran to ensure that this doesn't impact the OOM
behavior for containers.

I ran a bash command to consume as much cache as possible by simply reading all
files in the /usr/bin directory. Limiting this container's memory consumption
to 10m ("Test 1") results in this container using up to 10m memory in total.
But, its actual usage when we subtract the page cache turns out to be 0.5m.
This would show up in the console as 5.15625%, instead of 100% if this
were the only container running in the task.

Now, if the container is relaunched with a memory limit of 5m ("Test 2"), it
still consumes 5m in total, but the actual utilization continues to be at
0.5m. This would show up in the console as 10.15625%, instead of 100%
if this were the only container running in the task.

To summarize, merging this PR would result in a drop in the
"Memory Utilization" metric, which reflects the actual usage of memory by
containers, but shouldn't impact the OOM behavior of your
containers/tasks. There would also be a discrepancy between what's shown in
the Cloudwatch console and docker stats because of not accounting for page
cache utilization.

I have also verified this behavior across both generations of AMIs that support
ECS Telemetry ("Setup 1" and "Setup 2").

Please let me know if you have any concerns regarding the same.

Test 1: Run container with `10m` memory

$ docker run -d --name cache -m 10m 137112412989.dkr.ecr.us-west-2.amazonaws.com/amazonlinux:latest sh -c "while [ true ]; do for i in \`ls /usr/bin\`; do cat /usr/bin/\$i; done; done"
53b6870c4b99fd1659d331ddfe7ab814a75eeb9a4fd21128254b9e1569ed0801
$ docker stats  --no-stream cache
CONTAINER           CPU %               MEM USAGE / LIMIT     MEM %               NET I/O             BLOCK I/O           PIDS
cache               9.81%               10.49 MB / 10.49 MB   100.00%             578 B / 578 B       105.8 MB / 0 B      0
$ cat /cgroup/memory/docker/53b6870c4b99fd1659d331ddfe7ab814a75eeb9a4fd21128254b9e1569ed0801/memory.stat | head -n 1
cache 9895936
$ cat /cgroup/memory/docker/53b6870c4b99fd1659d331ddfe7ab814a75eeb9a4fd21128254b9e1569ed0801/memory.usage_in_bytes
10436608
$ python -c "print (10436608-9895936)*100/(5*1024*1024.0)"
5.15625

Test 2: Run container with `5m` memory

$ docker run -d --name cache -m 5m 137112412989.dkr.ecr.us-west-2.amazonaws.com/amazonlinux:latest sh -c "while [ true ]; do for i in \`ls /usr/bin\`; do cat /usr/bin/\$i; done; done"
184f9eec18b56d2a7de38e2077d8371d142db9ef028e8d6472a413587df11073
$ docker stats --no-stream cache
CONTAINER           CPU %               MEM USAGE / LIMIT    MEM %               NET I/O             BLOCK I/O           PIDS
cache               9.31%               5.21 MB / 5.243 MB   99.38%              648 B / 648 B       369.1 MB / 0 B      0
$ cat /cgroup/memory/docker/184f9eec18b56d2a7de38e2077d8371d142db9ef028e8d6472a413587df11073/memory.stat | head -n 1
cache 4653056
cat /cgroup/memory/docker/184f9eec18b56d2a7de38e2077d8371d142db9ef028e8d6472a413587df11073/memory.usage_in_bytes
5185536
$ python -c "print (5185536-4653056)*100/(5*1024*1024.0)"
10.15625

Setup 1: The amzn-ami-2016.09.b-amazon-ecs-optimized AMI

Docker Version:

$ docker version
Client:
 Version:      1.11.2
 API version:  1.23
 Go version:   go1.5.3
 Git commit:   b9f10c9/1.11.2
 Built:
 OS/Arch:      linux/amd64

Server:
 Version:      1.11.2
 API version:  1.23
 Go version:   go1.5.3
 Git commit:   b9f10c9/1.11.2
 Built:
 OS/Arch:      linux/amd64

Kernel Version:

$ uname -a
Linux ip-172-31-0-170 4.4.30-32.54.amzn1.x86_64 #1 SMP Thu Nov 10 15:52:05 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

ECS Agent Version:

$ curl --silent localhost:51678/v1/metadata | jq -r '.Version'
Amazon ECS Agent - v1.13.1 (efe53c6)

Setup 2: The amzn-ami-2015.09.a-amazon-ecs-optimized AMI

Docker Version:

$ docker version
Client version: 1.7.1
Client API version: 1.19
Go version (client): go1.4.2
Git commit (client): 786b29d/1.7.1
OS/Arch (client): linux/amd64
Server version: 1.7.1
Server API version: 1.19
Go version (server): go1.4.2
Git commit (server): 786b29d/1.7.1
OS/Arch (server): linux/amd64

Kernel Version:

$ uname -a
Linux ip-172-31-12-210 4.1.7-15.23.amzn1.x86_64 #1 SMP Mon Sep 14 23:20:33 UTC 2015 x86_64 x86_64 x86_64 GNU/Linu

ECS Agent Version:

$ curl --silent localhost:51678/v1/metadata | jq -r '.Version'
Amazon ECS Agent - v1.5.0 (b197edd)

Reference

cgroup memory docs

ZhbL · 2016-12-01T18:12:37Z

+1

samuelkarp · 2017-01-25T07:42:54Z

Released

ibrahima · 2017-02-02T01:53:13Z

If I understand correctly, even after #582 filesystem cache is still included when considering whether to OOM kill a process in a container? Just wanted to understand what was doing on and what the expected behavior actually is.

samuelkarp · 2017-02-02T01:55:53Z

@ibrahima The file cache should not affect OOM behavior as the kernel should be able to evict from the file cache to make more memory available to the processes within the container.

ibrahima · 2017-02-02T02:27:12Z

Woah, thanks for the fast response! I'm a bit confused because the behavior I'm seeing from refreshing top while my main process is running seems to imply that somehow it is getting killed while there is plenty of memory being used for cache when the process gets killed, but it may be due to something else.

I should add that we're currently on a very old AMI, planning to upgrade soon but just haven't had the chance.

timchenxiaoyu · 2019-10-09T06:11:35Z

I don't think so.

docker run -m limit all memory usage ,include cache

samuelkarp added the kind/feature request label Jan 22, 2016

aaithal mentioned this issue Mar 30, 2016

MemoryUtilization cloudwatch metric is being reported incorrectly #354

Closed

emilymcafee mentioned this issue Nov 2, 2016

exclude cache value from container memory utilization metric #582

Merged

4 tasks

samuelkarp added the os/linux label Nov 22, 2016

aaithal added the pending release label Dec 7, 2016

aaithal added this to the 1.14.0 milestone Dec 7, 2016

samuelkarp closed this as completed Jan 25, 2017

samuelkarp removed the pending release label Jan 25, 2017

aaithal mentioned this issue Feb 6, 2017

Cluster Memory Utilisation reporting incorrect value #697

Closed

nmeyerhans mentioned this issue Jun 28, 2017

Disparity between Cloudwatch's reported RAM utilization and docker stats #863

Closed

chrisgilbert mentioned this issue Oct 22, 2018

Deduct the cache figure from the memory usage percentage metric signalfx/docker-collectd-plugin#35

Merged

chrisgilbert mentioned this issue Oct 3, 2019

Docker memory stats are misleading signalfx/signalfx-agent#1009

Closed

ltm mentioned this issue Jul 29, 2022

Incorrect memory utilization under cgroup2 on Amazon Linux 2022 #3323

Closed

luhn mentioned this issue Feb 28, 2023

MemoryUtilization includes kernel caches #3594

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Container memory stats include filesystem cache usage #280

Container memory stats include filesystem cache usage #280

mikeybtn commented Jan 7, 2016

ankanm commented Jan 8, 2016

aaithal commented Jan 13, 2016

mikeybtn commented Jan 19, 2016

tschutte commented Feb 9, 2016

tschutte commented Feb 9, 2016

marklieberman commented Mar 23, 2016

aaithal commented Mar 28, 2016

hammerdr commented May 18, 2016

aaithal commented Nov 23, 2016

ZhbL commented Dec 1, 2016

samuelkarp commented Jan 25, 2017

ibrahima commented Feb 2, 2017 •

edited

Loading

samuelkarp commented Feb 2, 2017

ibrahima commented Feb 2, 2017 •

edited

Loading

timchenxiaoyu commented Oct 9, 2019

Container memory stats include filesystem cache usage #280

Container memory stats include filesystem cache usage #280

Comments

mikeybtn commented Jan 7, 2016

ankanm commented Jan 8, 2016

aaithal commented Jan 13, 2016

mikeybtn commented Jan 19, 2016

tschutte commented Feb 9, 2016

tschutte commented Feb 9, 2016

marklieberman commented Mar 23, 2016

aaithal commented Mar 28, 2016

hammerdr commented May 18, 2016

aaithal commented Nov 23, 2016

Test 1: Run container with 10m memory

Test 2: Run container with 5m memory

Setup 1: The amzn-ami-2016.09.b-amazon-ecs-optimized AMI

Setup 2: The amzn-ami-2015.09.a-amazon-ecs-optimized AMI

Reference

ZhbL commented Dec 1, 2016

samuelkarp commented Jan 25, 2017

ibrahima commented Feb 2, 2017 • edited Loading

samuelkarp commented Feb 2, 2017

ibrahima commented Feb 2, 2017 • edited Loading

timchenxiaoyu commented Oct 9, 2019

Test 1: Run container with `10m` memory

Test 2: Run container with `5m` memory

ibrahima commented Feb 2, 2017 •

edited

Loading

ibrahima commented Feb 2, 2017 •

edited

Loading