Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docker build cache on some machines is getting large (quickly) and exhausting space #3007

Closed
sxa opened this issue Mar 28, 2023 · 7 comments · Fixed by AdoptOpenJDK/openjdk-docker#630

Comments

@sxa
Copy link
Member

sxa commented Mar 28, 2023

Recent issues:

Between the last two issues there was about 2 weeks where the cache on the machine got up to 86Gb after being cleared out. We should look at understanding what is making it increase so much, whether it is expected or a problem, and the most appropriate way to mitigate it (cache's are generally helpful, but would we see use from a regular docker builder prune with a size limit on it e.g. docker builder prune -f --keep-storage 24000000000 to keep a couple of Gb.

@smlambert your input would be appreciated as this is on test systems - has something happened recently that has dramatically increased the amount of cache space that docker would be using (Perhaps the dev suites)? We've so far only ran into the issue on x64 and s390x but those may well just have been the first ones we've hit.

@sxa sxa added this to the 2023-04 (April) milestone Mar 28, 2023
@smlambert
Copy link
Contributor

I do not think this is a new problem per se, as there is a history of issues where we hit this, some of that history summarized somewhat by this one: #2510).

In terms of new dev level tests, none are currently being run regularly.

dev.openjdk includes openjdk container tests, though those should not be dispatched to static docker hosts (as they required sw.tool.docker label which should not be on any of the static docker hosts). And in any case, we disabled those in January, presumably temporarily for the release, since currently release test list and weekly test lists are entwined (use the same config file, which needs to be changed to allow us to run dev level testing on weekends but not worry about it getting run during releases).

dev.system tests which include jcstress tests but are also currently not running regularly, though I would like to enable them for weekend runs.

@sxa
Copy link
Member Author

sxa commented Mar 29, 2023

Most of the previous issues came down to the size of certain containers so this warrants a new issue compared to the old one. Based on your comment is it just the external tests that are running that would be creating new docker containers from docker files and so resulting in things going into the cache from a test perspective at present?

We also have https://ci.adoptium.net/job/openjdk_build_docker_multiarch/ running on a regular basis to keep the "old" images secure. It may be that re-enabling that (run 870 towards the end of January) has contributed to an increase in cache space.

@sxa
Copy link
Member Author

sxa commented Apr 3, 2023

I've run docker builder prune -f --keep-storage 10000000000 to reduce the size to 10Gb (Shows as 9.834GB) on all dockerhost systems with a view to checking in a week to see how they're all doing to evaluate how much space this is using. Here's a snapshot as of today:

[sxa@fedora ~]$ for A in ampere1 ampere2 xa xi 140.211.168.214 20.61.136.212 148.100.74.237; do echo $A; ssh root@$A docker system df; done 
ampere1
TYPE            TOTAL     ACTIVE    SIZE      RECLAIMABLE
Images          32        15        15.9GB    15.81GB (99%)
Containers      19        4         22.22GB   19.12GB (86%)
Local Volumes   0         0         0B        0B
Build Cache     0         0         0B        0B
ampere2
TYPE            TOTAL     ACTIVE    SIZE      RECLAIMABLE
Images          112       10        81.94GB   81.13GB (99%)
Containers      13        13        39.57GB   0B (0%)
Local Volumes   13        0         2.914GB   2.914GB (100%)
Build Cache     180       0         9.834GB   9.834GB
xa
TYPE            TOTAL     ACTIVE    SIZE      RECLAIMABLE
Images          71        9         47.32GB   47.32GB (100%)
Containers      10        10        30.03GB   0B (0%)
Local Volumes   0         0         0B        0B
Build Cache     74        0         0B        0B
xi
TYPE            TOTAL     ACTIVE    SIZE      RECLAIMABLE
Images          61        9         44.28GB   44.28GB (100%)
Containers      9         9         21.06GB   0B (0%)
Local Volumes   8         0         1.542GB   1.542GB (100%)
Build Cache     227       0         9.788GB   9.788GB
140.211.168.214
TYPE            TOTAL     ACTIVE    SIZE      RECLAIMABLE
Images          27        4         16.64GB   13.51GB (81%)
Containers      4         3         1.811GB   0B (0%)
Local Volumes   0         0         0B        0B
Build Cache     115       0         7.138GB   7.138GB
20.61.136.212
TYPE            TOTAL     ACTIVE    SIZE      RECLAIMABLE
Images          14        2         12.58GB   10.21GB (81%)
Containers      6         6         4.954GB   0B (0%)
Local Volumes   0         0         0B        0B
Build Cache     17        0         461.5MB   461.5MB
148.100.74.237
TYPE            TOTAL     ACTIVE    SIZE      RECLAIMABLE
Images          12        9         4.141GB   500.8MB (12%)
Containers      12        3         2.688GB   361.5MB (13%)
Local Volumes   0         0         0B        0B
Build Cache     0         0         0B        0B
[sxa@fedora ~]$ 

@sxa
Copy link
Member Author

sxa commented Apr 6, 2023

Output from today. A number of the machines have had a significant increase in the build cache size:

ampere1
TYPE            TOTAL     ACTIVE    SIZE      RECLAIMABLE
Images          32        15        15.9GB    15.81GB (99%)
Containers      19        4         22.24GB   19.12GB (85%)
Local Volumes   0         0         0B        0B
Build Cache     0         0         0B        0B
ampere2
TYPE            TOTAL     ACTIVE    SIZE      RECLAIMABLE
Images          114       10        86.25GB   85.44GB (99%)
Containers      13        13        39.73GB   0B (0%)
Local Volumes   13        0         2.914GB   2.914GB (100%)
Build Cache     1276      0         196.7GB   196.7GB
xa
TYPE            TOTAL     ACTIVE    SIZE      RECLAIMABLE
Images          72        9         38.04GB   38.04GB (100%)
Containers      10        10        14.62GB   0B (0%)
Local Volumes   0         0         0B        0B
Build Cache     2106      7         215.9GB   215.9GB
xi
TYPE            TOTAL     ACTIVE    SIZE      RECLAIMABLE
Images          71        13        51.61GB   49.36GB (95%)
Containers      13        9         7.353GB   4.8kB (0%)
Local Volumes   8         4         9.129GB   9.103GB (99%)
Build Cache     2123      0         228.6GB   228.6GB
140.211.168.214
TYPE            TOTAL     ACTIVE    SIZE      RECLAIMABLE
Images          25        3         23.52GB   20.61GB (87%)
Containers      3         3         2.072GB   0B (0%)
Local Volumes   0         0         0B        0B
Build Cache     115       0         7.138GB   7.138GB
20.61.136.212
TYPE            TOTAL     ACTIVE    SIZE      RECLAIMABLE
Images          13        2         4.711GB   2.345GB (49%)
Containers      6         6         4.728GB   0B (0%)
Local Volumes   0         0         0B        0B
Build Cache     17        0         461.5MB   461.5MB
148.100.74.237
TYPE            TOTAL     ACTIVE    SIZE      RECLAIMABLE
Images          12        9         4.141GB   500.8MB (12%)
Containers      12        3         2.462GB   361.5MB (14%)
Local Volumes   0         0         0B        0B
Build Cache     0         0         0B        0B

@sxa
Copy link
Member Author

sxa commented Apr 6, 2023

I'm running a check at 5 minute intervals on one of the machines that's currently running the docker_build_multiarch job (ampere2 from the above list).

Thu 06 Apr 2023 05:20:11 PM UTC
TYPE            TOTAL     ACTIVE    SIZE      RECLAIMABLE
Images          126       10        87.71GB   86.9GB (99%)
Containers      13        13        39.73GB   0B (0%)
Local Volumes   13        0         2.914GB   2.914GB (100%)
Build Cache     1271      0         193.3GB   193.3GB
Thu 06 Apr 2023 05:25:15 PM UTC
TYPE            TOTAL     ACTIVE    SIZE      RECLAIMABLE
Images          110       10        81.94GB   81.13GB (99%)
Containers      13        13        39.73GB   0B (0%)
Local Volumes   13        0         2.914GB   2.914GB (100%)
Build Cache     1271      0         198.9GB   198.9GB
Thu 06 Apr 2023 05:30:19 PM UTC
TYPE            TOTAL     ACTIVE    SIZE      RECLAIMABLE
Images          113       10        85.21GB   84.4GB (99%)
Containers      13        13        39.73GB   0B (0%)
Local Volumes   13        0         2.914GB   2.914GB (100%)
Build Cache     1276      7         197.5GB   196.7GB
Thu 06 Apr 2023 05:35:23 PM UTC
TYPE            TOTAL     ACTIVE    SIZE      RECLAIMABLE
Images          125       10        90.28GB   89.47GB (99%)
Containers      13        13        39.73GB   0B (0%)
Local Volumes   13        0         2.914GB   2.914GB (100%)
Build Cache     1291      7         194.6GB   194.6GB
Thu 06 Apr 2023 05:40:27 PM UTC
TYPE            TOTAL     ACTIVE    SIZE      RECLAIMABLE
Images          138       10        97.08GB   96.27GB (99%)
Containers      13        13        39.73GB   0B (0%)
Local Volumes   13        0         2.914GB   2.914GB (100%)
Build Cache     1306      0         191.4GB   191.4GB
Thu 06 Apr 2023 05:45:32 PM UTC
TYPE            TOTAL     ACTIVE    SIZE      RECLAIMABLE
Images          142       10        98.54GB   97.73GB (99%)
Containers      13        13        39.73GB   0B (0%)
Local Volumes   13        0         2.914GB   2.914GB (100%)
Build Cache     1311      0         190.7GB   190.7GB
Thu 06 Apr 2023 05:50:36 PM UTC
TYPE            TOTAL     ACTIVE    SIZE      RECLAIMABLE
Images          142       10        98.54GB   97.73GB (99%)
Containers      13        13        39.73GB   0B (0%)
Local Volumes   13        0         2.914GB   2.914GB (100%)
Build Cache     1311      0         190.7GB   190.7GB
Thu 06 Apr 2023 05:55:40 PM UTC
TYPE            TOTAL     ACTIVE    SIZE      RECLAIMABLE
Images          142       10        98.54GB   97.73GB (99%)
Containers      13        13        39.73GB   0B (0%)
Local Volumes   13        0         2.914GB   2.914GB (100%)
Build Cache     1311      0         190.7GB   190.7GB
Thu 06 Apr 2023 06:00:44 PM UTC
TYPE            TOTAL     ACTIVE    SIZE      RECLAIMABLE
Images          117       10        84.71GB   83.9GB (99%)
Containers      13        13        39.73GB   0B (0%)
Local Volumes   13        0         2.914GB   2.914GB (100%)
Build Cache     1318      4         205.7GB   205.7GB
Thu 06 Apr 2023 06:05:48 PM UTC
TYPE            TOTAL     ACTIVE    SIZE      RECLAIMABLE
Images          126       10        87.39GB   86.58GB (99%)
Containers      13        13        39.73GB   0B (0%)
Local Volumes   13        0         2.914GB   2.914GB (100%)
Build Cache     1327      0         204.5GB   204.5GB
Thu 06 Apr 2023 06:10:52 PM UTC
TYPE            TOTAL     ACTIVE    SIZE      RECLAIMABLE
Images          110       10        81.94GB   81.13GB (99%)
Containers      13        13        39.73GB   0B (0%)
Local Volumes   13        0         2.914GB   2.914GB (100%)
Build Cache     1327      0         209.8GB   209.8GB

So it's chewing up close to 20GB extra on each run of that job on that machine.

I've also enabled ampere1 for these jobs and build jobs and taken ampere2 offline for now in order to evaluate whether it exhibits the same behaviour (It's Ubuntu 22.04 instead of 20.04 so will be a different docker version. As a point of note, when running a new build job (which pulls down our docker build image) no additional Build Cache space is used.

@sxa
Copy link
Member Author

sxa commented Apr 11, 2023

Today's output (compare with five days ago):

ampere1
TYPE            TOTAL     ACTIVE    SIZE      RECLAIMABLE
Images          45        3         37.25GB   37.25GB (100%)
Containers      4         4         3.14GB    0B (0%)
Local Volumes   0         0         0B        0B
Build Cache     939       0         159.3GB   159.3GB
ampere2
TYPE            TOTAL     ACTIVE    SIZE      RECLAIMABLE
Images          110       10        81.94GB   81.13GB (99%)
Containers      13        13        39.9GB    0B (0%)
Local Volumes   13        0         2.914GB   2.914GB (100%)
Build Cache     1327      0         209.8GB   209.8GB
xa
TYPE            TOTAL     ACTIVE    SIZE      RECLAIMABLE
Images          71        9         47.55GB   47.55GB (100%)
Containers      10        10        14.49GB   0B (0%)
Local Volumes   0         0         0B        0B
Build Cache     4498      0         487GB     487GB
xi
TYPE            TOTAL     ACTIVE    SIZE      RECLAIMABLE
Images          59        9         34.13GB   34.13GB (100%)
Containers      9         9         7.378GB   0B (0%)
Local Volumes   8         0         9.129GB   9.129GB (100%)
Build Cache     3895      0         436.8GB   436.8GB
140.211.168.214
TYPE            TOTAL     ACTIVE    SIZE      RECLAIMABLE
Images          26        4         16.04GB   13.12GB (81%)
Containers      4         4         2.5GB     0B (0%)
Local Volumes   0         0         0B        0B
Build Cache     115       0         7.138GB   7.138GB
20.61.136.212
TYPE            TOTAL     ACTIVE    SIZE      RECLAIMABLE
Images          14        2         12.58GB   10.21GB (81%)
Containers      6         6         4.73GB    0B (0%)
Local Volumes   0         0         0B        0B
Build Cache     17        0         461.5MB   461.5MB
148.100.74.237
TYPE            TOTAL     ACTIVE    SIZE      RECLAIMABLE
Images          12        9         4.141GB   500.8MB (12%)
Containers      12        3         2.463GB   361.5MB (14%)
Local Volumes   0         0         0B        0B
Build Cache     0         0         0B        0B

TL;DR up to 200Gb extra have been consumed on some of the machines.
I've adjusted the repository for https://ci.adoptium.net/job/openjdk_build_docker_multiarch/ from AdoptOpenJDK/openjdk-docker to sxa/openjdk-docker purge-cache branch and if that keeps things happy for the next few days then we can merge that into the AdoptOpenJDK repo.

@sxa
Copy link
Member Author

sxa commented Apr 25, 2023

It's now behaving, other than on arm32 where for some odd reason if the number is larger than something just over 2.1GB you get e.g. Error response from daemon: keep-storage is in bytes and expects an integer, got 2200000000: strconv.Atoi: parsing "2200000000": value out of range

Fixed with an armv7l specific check in the code (I could have just skipped it since the arm32 machines aren't having a problem with this), but the output on the other machines looks good:

[sxa@fedora ~]$ for A in ampere1 ampere2 xa xi 140.211.168.214 20.61.136.212 148.100.74.237; do echo $A; ssh root@$A docker system df; done
ampere1
TYPE            TOTAL     ACTIVE    SIZE      RECLAIMABLE
Images          43        3         28.34GB   28.34GB (100%)
Containers      4         4         2.957GB   0B (0%)
Local Volumes   0         0         0B        0B
Build Cache     131       0         9.672GB   9.672GB
ampere2
TYPE            TOTAL     ACTIVE    SIZE      RECLAIMABLE
Images          111       10        82.91GB   82.1GB (99%)
Containers      13        13        39.99GB   0B (0%)
Local Volumes   13        0         2.914GB   2.914GB (100%)
Build Cache     210       4         9.455GB   9.455GB
xa
TYPE            TOTAL     ACTIVE    SIZE      RECLAIMABLE
Images          70        9         35.54GB   35.54GB (100%)
Containers      11        11        6.581GB   0B (0%)
Local Volumes   0         0         0B        0B
Build Cache     220       5         10.09GB   9.868GB
xi
TYPE            TOTAL     ACTIVE    SIZE      RECLAIMABLE
Images          61        10        44.48GB   44.33GB (99%)
Containers      10        9         2.57GB    385.1MB (14%)
Local Volumes   8         0         9.129GB   9.129GB (100%)
Build Cache     258       0         9.712GB   9.712GB
140.211.168.214
TYPE            TOTAL     ACTIVE    SIZE      RECLAIMABLE
Images          28        4         17.24GB   14.32GB (83%)
Containers      4         4         10.98GB   0B (0%)
Local Volumes   0         0         0B        0B
Build Cache     115       0         7.138GB   7.138GB
20.61.136.212
TYPE            TOTAL     ACTIVE    SIZE      RECLAIMABLE
Images          14        3         4.711GB   2.345GB (49%)
Containers      7         7         5.134GB   0B (0%)
Local Volumes   0         0         0B        0B
Build Cache     17        0         461.5MB   461.5MB
148.100.74.237
TYPE            TOTAL     ACTIVE    SIZE      RECLAIMABLE
Images          13        3         3.862GB   1.137GB (29%)
Containers      3         3         2.105GB   0B (0%)
Local Volumes   0         0         0B        0B
Build Cache     0         0         0B        0B
[sxa@fedora ~]$ 

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Status: Done
Development

Successfully merging a pull request may close this issue.

3 participants