Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature request: Retry if Registry Returns 5xx HTTP Error #2299

Open
adambkaplan opened this issue Jan 16, 2025 · 7 comments
Open

feature request: Retry if Registry Returns 5xx HTTP Error #2299

adambkaplan opened this issue Jan 16, 2025 · 7 comments

Comments

@adambkaplan
Copy link

Feature Request

Improve skopeo's "retry" logic to retry if the registry returns a 5xx HTTP error code. Such an error can be intermittent in a production environment, depending on how the container registry is configured/deployed.

Use Case

On Red Hat's Konflux, skopeo is used to build the "source" container image/OCI artifact. This step in the build pipeline can fail if the upstream registry (such as registry.redhat.io) is having intermittent performance issues.

@adambkaplan
Copy link
Author

From a recent run log, it appears skopeo does not retry in this case:

2025-01-16 09:19:35,256:source-build:DEBUG:copy image: ['skopeo', 'copy', '--retry-times', '5', '--remove-signatures', 'docker://registry.access.redhat.com/ubi9-minimal:9.5-1731593028-source', 'oci:/var/workdir/source-build/parent_image_sources']
Copying blob sha256:dabcb493812b34ce7aea02e0f4ea5735cd7bb9d973389c9fed628d8166ea505f
Copying blob sha256:3298bbb4739d5a3e758c0e80fc69a77573a730c2f8427fcc719dc2dbeffdabc8
Copying blob sha256:b8d6db0209c1f7c3cc256a6ae2065afd105356fb0e49f71a2db4211037236dbe
...
...
Copying blob sha256:dcce78f78298cf49a96b44952bdd6ae67ab69a911253bd331c9320549592accd
time="2025-01-16T09:19:52Z" level=fatal msg="reading blob sha256:50031b8ea1fd30e7eb133ee62037adb5ff984ab77bac81883b9ab9880b488f01: fetching blob: received unexpected HTTP status: 502 Bad Gateway"

@mtrmac
Copy link
Contributor

mtrmac commented Jan 16, 2025

Thanks for your report.

This logic is centralized in https://github.com/containers/common/tree/main/pkg/retry, moving there.

@dustymabe
Copy link
Contributor

As noted in containers/podman#25109 (comment) this also causes intermediate 502 bad gateway errors that we often get from quay.io to cause quadlets to not come up on initial system startup.

@dustymabe
Copy link
Contributor

I assume retrying 502 would also be really useful for podman push. Hit this this morning:

[2025-01-27T12:20:50.106Z] 2025-01-27 12:20:50,052 INFO - Running command: ['podman', 'push', 'quay.io/coreos-assembler/staging:aarch64-69f5874']
[2025-01-27T12:20:50.361Z] Getting image source signatures
[2025-01-27T12:20:50.361Z] Copying blob sha256:3bb8e7714d4f7920a3723d2261381aa38597e9ae20b33e656ace2427c48276cb
[2025-01-27T12:21:05.814Z] Copying blob sha256:fc3fbed5ce15e1cdddbcec5a85b91091aeb9dd6b16951e764769ab32d3132327
[2025-01-27T12:21:52.408Z] Error: writing blob: uploading layer to https://quay.io/v2/coreos-assembler/staging/blobs/uploads/4e8204f9-867d-421b-b234-9965fe90967f?digest=sha256%3A47d9cf72e71e4f96c2efb4c8ba6cbaf204ea2331b6ef939719a82ddd92768273: received unexpected HTTP status: 502 Bad Gateway

@adambkaplan
Copy link
Author

I imagine 502 or 503 are the most common server-side HTTP errors that can come from a container registry. Personally I think it's safe to retry on any 5xx code since the default retry logic uses an exponential backoff.

@Luap99
Copy link
Member

Luap99 commented Jan 28, 2025

I can confirm that 502 from quay.io is one of the most common errors I see in our (podman) CI logs. It was so flaky we had to move to a local cache registry to solve move of these issues. However some pulls still hit the real quay.io and in other test env where we do not have the cache registry (e.g. gating test on distros) so I still see regular 502 in the logs.

So I totally agree here that retying 5XX codes should help many of our users.

That said I am not sure how pkg/retry can be used there without other changes. IsErrorRetryable() checks an error BUT the HTTP request was successful and the go std library does not return a known error type we could match for.
https://pkg.go.dev/net/http#Get

A non-2xx response doesn't cause an error.

c/image defines its own type unexpectedHTTPStatusError but that one is not exported. So to me it seems that must first be exported in c/image then we could update IsErrorRetryable() to match for that specific error type and then compare status code is in the 5xx range
cc @mtrmac

@mtrmac
Copy link
Contributor

mtrmac commented Jan 28, 2025

Sure. I probably wouldn’t want to make a c/image-wide new error type, it’s hard to promise that we can provide that information for various (current and hypothetical future) external dependencies (like a Fulcio/Rekor client) — but adding a new c/image/v5/docker error type for this purpose seems entirely reasonable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants