Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Container memory stats include filesystem cache usage #280

Closed
mikeybtn opened this issue Jan 7, 2016 · 15 comments
Closed

Container memory stats include filesystem cache usage #280

mikeybtn opened this issue Jan 7, 2016 · 15 comments

Comments

@mikeybtn
Copy link

mikeybtn commented Jan 7, 2016

On ecs, we noticed the MemoryUtilization graph for one service steadily growing, and began looking for a memory leak in that service.

On the vm, the graphed value was consistent with docker stats <container>. However, attaching to the container showed the RSS/VSIZE of all running process was stable over time, and less than the reported value. The investigation led us to moby/moby#10824 and its eventual resolution in docker-archive/libcontainer#518. In short, the usage figure seems to include the page cache.

It looks like aws-ecs-agent does not subtract the cache value when building/reporting container stats.

Would it make sense to report usage as CgroupStats.MemoryStats.Usage - CgroupStats.MemoryStats.Cache, to avoid confusing evictable memory with "real" usage? Or is there a technique we're neglecting that could avoid this situation altogether? Thanks!

@ankanm
Copy link

ankanm commented Jan 8, 2016

+1

@aaithal
Copy link
Contributor

aaithal commented Jan 13, 2016

@mikeybtn , thank you for reporting this issue. You are correct in pointing out that the ECS Agent does not subtract the memory cache value from usage while reporting memory stats. We also need to take in to account if subtracting the cache value is the proper way of handling this or if we should expose the a different metric to deal with the ‘Cache’ memory stat (as was the case here where docker-archive/libcontainer#518 fixed this behavior in docker-archive/libcontainer#506). We will get back to you when we have an update for this issue.

Thanks,
Anirudh

@mikeybtn
Copy link
Author

@aaithal thanks for acknowledging! I can see how this might extend to a wider product issue, where you need to surface both in AWS charts and metrics. I'd be happy to help however I can.

@tschutte
Copy link

tschutte commented Feb 9, 2016

Thanks for identifying this issue. We also recently deployed a legacy Java app into ECS, and I was concerned seeing that the service is reporting memory utilization at a level almost twice our JVM -Xmx setting. My own investigation also agrees that ECS is reporting memory cache as part of overall memory utilization.

My only concern at this point is that we have a container improperly stopped by ECS due to memory usage exceeding the container definition, when the memory usage is only due to file cache and not our application.

@tschutte
Copy link

tschutte commented Feb 9, 2016

And now that I've done a little more reading to better understand what is really going on in Docker regarding memory usage, reporting, and limits, I now realize my question above is invalid since the ECS agent is not responsible for killing the container on OOM, but rather the kernel cgroup.

@marklieberman
Copy link

+1

I have a docker service that produces a lot of temporary files. ECS is killing it periodically even though our JVM metrics show the JVM usage is stable and well below the memory limit.

@aaithal
Copy link
Contributor

aaithal commented Mar 28, 2016

@marklieberman, thank you for your feedback. Please note that the memory cgroup limit is a parameter set by the docker daemon (eventually enforced as memory cgroup limit by the kernel). This is also documented in the ECS documentation about task definitions.

@hammerdr
Copy link

Just going to +1 this issue. Important for us so that we can size-to-fit our containers appropriately. Specifically, this prevents us from running a low-memory process with low memory limits because we do not know the difference between process memory or cache (we could build something that does this for us, but we'll just make due with higher memory limits for now).

@aaithal
Copy link
Contributor

aaithal commented Nov 23, 2016

Hi All,

I wanted to provide an update on this wrt changes proposed as per #582. Merging
this would mean that ECS Memory Utilization metrics for Cluster and Service
would exclude the number of bytes of page cache memory. I also wanted to share
the results of the experiments I ran to ensure that this doesn't impact the OOM
behavior for containers.

I ran a bash command to consume as much cache as possible by simply reading all
files in the /usr/bin directory. Limiting this container's memory consumption
to 10m ("Test 1") results in this container using up to 10m memory in total.
But, its actual usage when we subtract the page cache turns out to be 0.5m.
This would show up in the console as 5.15625%, instead of 100% if this
were the only container running in the task.

Now, if the container is relaunched with a memory limit of 5m ("Test 2"), it
still consumes 5m in total, but the actual utilization continues to be at
0.5m. This would show up in the console as 10.15625%, instead of 100%
if this were the only container running in the task.

To summarize, merging this PR would result in a drop in the
"Memory Utilization" metric, which reflects the actual usage of memory by
containers, but shouldn't impact the OOM behavior of your
containers/tasks. There would also be a discrepancy between what's shown in
the Cloudwatch console and docker stats because of not accounting for page
cache utilization.

I have also verified this behavior across both generations of AMIs that support
ECS Telemetry ("Setup 1" and "Setup 2").

Please let me know if you have any concerns regarding the same.

Test 1: Run container with 10m memory
$ docker run -d --name cache -m 10m 137112412989.dkr.ecr.us-west-2.amazonaws.com/amazonlinux:latest sh -c "while [ true ]; do for i in \`ls /usr/bin\`; do cat /usr/bin/\$i; done; done"
53b6870c4b99fd1659d331ddfe7ab814a75eeb9a4fd21128254b9e1569ed0801
$ docker stats  --no-stream cache
CONTAINER           CPU %               MEM USAGE / LIMIT     MEM %               NET I/O             BLOCK I/O           PIDS
cache               9.81%               10.49 MB / 10.49 MB   100.00%             578 B / 578 B       105.8 MB / 0 B      0
$ cat /cgroup/memory/docker/53b6870c4b99fd1659d331ddfe7ab814a75eeb9a4fd21128254b9e1569ed0801/memory.stat | head -n 1
cache 9895936
$ cat /cgroup/memory/docker/53b6870c4b99fd1659d331ddfe7ab814a75eeb9a4fd21128254b9e1569ed0801/memory.usage_in_bytes
10436608
$ python -c "print (10436608-9895936)*100/(5*1024*1024.0)"
5.15625
Test 2: Run container with 5m memory
$ docker run -d --name cache -m 5m 137112412989.dkr.ecr.us-west-2.amazonaws.com/amazonlinux:latest sh -c "while [ true ]; do for i in \`ls /usr/bin\`; do cat /usr/bin/\$i; done; done"
184f9eec18b56d2a7de38e2077d8371d142db9ef028e8d6472a413587df11073
$ docker stats --no-stream cache
CONTAINER           CPU %               MEM USAGE / LIMIT    MEM %               NET I/O             BLOCK I/O           PIDS
cache               9.31%               5.21 MB / 5.243 MB   99.38%              648 B / 648 B       369.1 MB / 0 B      0
$ cat /cgroup/memory/docker/184f9eec18b56d2a7de38e2077d8371d142db9ef028e8d6472a413587df11073/memory.stat | head -n 1
cache 4653056
cat /cgroup/memory/docker/184f9eec18b56d2a7de38e2077d8371d142db9ef028e8d6472a413587df11073/memory.usage_in_bytes
5185536
$ python -c "print (5185536-4653056)*100/(5*1024*1024.0)"
10.15625
Setup 1: The amzn-ami-2016.09.b-amazon-ecs-optimized AMI
  • Docker Version:
$ docker version
Client:
 Version:      1.11.2
 API version:  1.23
 Go version:   go1.5.3
 Git commit:   b9f10c9/1.11.2
 Built:
 OS/Arch:      linux/amd64

Server:
 Version:      1.11.2
 API version:  1.23
 Go version:   go1.5.3
 Git commit:   b9f10c9/1.11.2
 Built:
 OS/Arch:      linux/amd64
  • Kernel Version:
$ uname -a
Linux ip-172-31-0-170 4.4.30-32.54.amzn1.x86_64 #1 SMP Thu Nov 10 15:52:05 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
  • ECS Agent Version:
$ curl --silent localhost:51678/v1/metadata | jq -r '.Version'
Amazon ECS Agent - v1.13.1 (efe53c6)
Setup 2: The amzn-ami-2015.09.a-amazon-ecs-optimized AMI
  • Docker Version:
$ docker version
Client version: 1.7.1
Client API version: 1.19
Go version (client): go1.4.2
Git commit (client): 786b29d/1.7.1
OS/Arch (client): linux/amd64
Server version: 1.7.1
Server API version: 1.19
Go version (server): go1.4.2
Git commit (server): 786b29d/1.7.1
OS/Arch (server): linux/amd64
  • Kernel Version:
$ uname -a
Linux ip-172-31-12-210 4.1.7-15.23.amzn1.x86_64 #1 SMP Mon Sep 14 23:20:33 UTC 2015 x86_64 x86_64 x86_64 GNU/Linu
  • ECS Agent Version:
$ curl --silent localhost:51678/v1/metadata | jq -r '.Version'
Amazon ECS Agent - v1.5.0 (b197edd)
Reference

@ZhbL
Copy link

ZhbL commented Dec 1, 2016

+1

@aaithal aaithal added this to the 1.14.0 milestone Dec 7, 2016
@samuelkarp
Copy link
Contributor

Released

@ibrahima
Copy link

ibrahima commented Feb 2, 2017

If I understand correctly, even after #582 filesystem cache is still included when considering whether to OOM kill a process in a container? Just wanted to understand what was doing on and what the expected behavior actually is.

@samuelkarp
Copy link
Contributor

@ibrahima The file cache should not affect OOM behavior as the kernel should be able to evict from the file cache to make more memory available to the processes within the container.

@ibrahima
Copy link

ibrahima commented Feb 2, 2017

Woah, thanks for the fast response! I'm a bit confused because the behavior I'm seeing from refreshing top while my main process is running seems to imply that somehow it is getting killed while there is plenty of memory being used for cache when the process gets killed, but it may be due to something else.

I should add that we're currently on a very old AMI, planning to upgrade soon but just haven't had the chance.

@timchenxiaoyu
Copy link

I don't think so.
image
docker run -m limit all memory usage ,include cache

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants