You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
My team runs the community edition of the Hasura graphql-engine on Kubernetes in a horizontally replicated configuration. We use rolling updates to deploy, use the cli-migrations (v2) docker image as our base image in production (as described here), and use the /healthz?strict=true endpoint as our container's readiness/liveness check. This allows us to deploy Hasura in prod without exposing the console or metadata API, while still allowing for git-driven continuous delivery of metadata changes by version-controlling our metadata files and packaging them into new docker images that get deployed to our Kube cluster on each commit to main. This works great most of the time, allowing us to roll out changes to our GraphQL API just by merging metadata changes to main.
However, this configuration has one major drawback. The cli-migrations entrypoint script (ref) runs hasura-cli metadata apply without the --disallow-inconsistent-metadata flag. This means that if we accidentally merge inconsistent metadata to main (eg, by applying metadata changes before a production DB migration has completed), the new pods for our updated Kubernetes deployment will apply that metadata and persist it to the database regardless of any issues. This metadata update in turn gets picked up by the existing/old pods, which listen for changes to the metadata DB, and "leaks" into them even though they (in theory) were built from an older, functioning revision of our code.
In effect, this defeats the purpose of using a deployment tool like Kubernetes, because what should have been a failed deploy ends up taking down the existing pods, rather than simply failing to spin up new ones. This breaks from the standard mental model of rolling updates, in which a pre-existing stable deployment should stay available until new pods are ready to receive traffic. This scenario befell us earlier this week, leading to a full API outage instead of what we'd expect for any other API that can't become ready - a failed rollout of the new without affecting the old.
Describe the solution you'd like
I'd love to have some way to configure the cli-migrations image to disallow the application of inconsistent metadata, the same way I know is possible using the CLI's hasura metadata apply --disallow-inconsistent-metadata command. I'm open to whether that should be possible via a global Hasura configuration (ie, configure this as a setting on the server itself) or just a flag/env-var that can be passed to the cli-migrations image similar to the HASURA_GRAPHQL_METADATA_DIR variable. I expect the latter would be simpler to implement.
Doing this would allow us to have a fully safe CI/CD setup in Kubernetes by preventing pods which contain inconsistent metadata from poisoning the shared metadata DB and taking down the entire API.
Just noting - I would be very happy to work on this feature myself, in particular if we'd prefer to scope it to a change to the cli-migrations image only. This would be really helpful for my team, and I'd love a chance to help contribute back to Hasura given how much use we've gotten out of the graphql engine over the last year or two!
Describe alternatives you've considered
I see a few different ways of working around this, which I'll explain my POV on inline:
Stop using the cli-migrations image, and use the CLI to apply metadata in production directly. I'd really prefer not to do this for my team, because it would mean exposing the metadata API in prod, which would then make it accessible to anyone who has the admin key. This would be a pretty major step up in the risk/impact of leaking that key, which I'd rather not take on.
Don't enforce "strict=true" on the /healthz endpoint for our readiness and liveness checks, allow inconsistent metadata to applied to prod, and monitor for it some other way. I'd also prefer to avoid this - given it is possible to determine prior to deployment whether a batch of metadata files will be inconsistent (and thus lead to some amount of API degradation), I see no good reason we'd want our prod service to declare itself ready for traffic in those scenarios.
Improve our own testing/quality gates prior to merging to reduce the odds of inconsistent metadata making it into main in the first place. Of course I'd love to do this as well, but no testing is going to cover 100% of possible mistakes, and the change I'm recommending seems (to me) like a surefire way to prevent broken deploys anyways.
The text was updated successfully, but these errors were encountered:
I ended up putting together an approach for this in PR #10602 - not sure if that's the approach the maintenance team wants to take, or if this is a feature you all actually want to pick up or not. If not, no worries, just wanted to take a stab at it cause I was curious if I could get it working. Let me know either way!
Is your proposal related to a problem?
My team runs the community edition of the Hasura graphql-engine on Kubernetes in a horizontally replicated configuration. We use rolling updates to deploy, use the
cli-migrations
(v2) docker image as our base image in production (as described here), and use the/healthz?strict=true
endpoint as our container's readiness/liveness check. This allows us to deploy Hasura in prod without exposing the console or metadata API, while still allowing for git-driven continuous delivery of metadata changes by version-controlling our metadata files and packaging them into new docker images that get deployed to our Kube cluster on each commit to main. This works great most of the time, allowing us to roll out changes to our GraphQL API just by merging metadata changes to main.However, this configuration has one major drawback. The
cli-migrations
entrypoint script (ref) runshasura-cli metadata apply
without the--disallow-inconsistent-metadata
flag. This means that if we accidentally merge inconsistent metadata to main (eg, by applying metadata changes before a production DB migration has completed), the new pods for our updated Kubernetes deployment will apply that metadata and persist it to the database regardless of any issues. This metadata update in turn gets picked up by the existing/old pods, which listen for changes to the metadata DB, and "leaks" into them even though they (in theory) were built from an older, functioning revision of our code.In effect, this defeats the purpose of using a deployment tool like Kubernetes, because what should have been a failed deploy ends up taking down the existing pods, rather than simply failing to spin up new ones. This breaks from the standard mental model of rolling updates, in which a pre-existing stable deployment should stay available until new pods are ready to receive traffic. This scenario befell us earlier this week, leading to a full API outage instead of what we'd expect for any other API that can't become ready - a failed rollout of the new without affecting the old.
Describe the solution you'd like
I'd love to have some way to configure the
cli-migrations
image to disallow the application of inconsistent metadata, the same way I know is possible using the CLI'shasura metadata apply --disallow-inconsistent-metadata
command. I'm open to whether that should be possible via a global Hasura configuration (ie, configure this as a setting on the server itself) or just a flag/env-var that can be passed to thecli-migrations
image similar to theHASURA_GRAPHQL_METADATA_DIR
variable. I expect the latter would be simpler to implement.Doing this would allow us to have a fully safe CI/CD setup in Kubernetes by preventing pods which contain inconsistent metadata from poisoning the shared metadata DB and taking down the entire API.
Just noting - I would be very happy to work on this feature myself, in particular if we'd prefer to scope it to a change to the
cli-migrations
image only. This would be really helpful for my team, and I'd love a chance to help contribute back to Hasura given how much use we've gotten out of the graphql engine over the last year or two!Describe alternatives you've considered
I see a few different ways of working around this, which I'll explain my POV on inline:
cli-migrations
image, and use the CLI to apply metadata in production directly. I'd really prefer not to do this for my team, because it would mean exposing the metadata API in prod, which would then make it accessible to anyone who has the admin key. This would be a pretty major step up in the risk/impact of leaking that key, which I'd rather not take on./healthz
endpoint for our readiness and liveness checks, allow inconsistent metadata to applied to prod, and monitor for it some other way. I'd also prefer to avoid this - given it is possible to determine prior to deployment whether a batch of metadata files will be inconsistent (and thus lead to some amount of API degradation), I see no good reason we'd want our prod service to declare itself ready for traffic in those scenarios.The text was updated successfully, but these errors were encountered: