-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Intermediate "Ingress Reconciliation failed" event received during reconciliation #9727
Comments
Here is the flow of events that lead to this error:
The event at the line with the arrow indicates when the |
Which Ingress plugin are you using here? This seems like the respective plugin is misbehaving in that it reports |
I experienced that on Google Cloud Run (GKE platform), I just got:
Knative serving: |
Kourier affair, but that was already since quite some time (sorry for the late reply @markusthoemmes , just overlooked your question :) |
If it's Kourier, is it still an issue at HEAD then? I've recently done quite a few changes to Kourier's status reporting. |
Unfortunately, I can't reproduce it anymore as we introduced a client safeguard for this kind of situations (with knative/client#1052) For our needs this is good enough, so from my side, the issue could be closed. Not sure if the other reference to the Google Cloud Run error really applies here to use (it's hard to test anyway), or whether it would be better to hunt that error from the CloudRun side (with Google support channels for cloud run). |
Are all clients expected to implement an error window like kn that can be used as a workaround for this issue? In clients like gcloud for example, I believe False is always considered a terminal failure, so it remains susceptible to this issue. |
/area API /assign @dprotaso I'm assuming that we consider this a bug. 😁 /triage accepted |
I wonder if this is even still a bug 🤔 |
It might be true that a client needs to be resilient for this flow of events that always can occur in an eventually-consistent system like a declarative reconciliation platform as Kubernetes is. But if "Reconciliation Failed" is an intermediate and non an error state, that this should be detectable somehow. For any client it would be very helpful to get some more insights on the reconciliation process: Is it a "terminal" error that is likely to last for some time ? Is it an "intermediate state" which just can be considered to be a tag for an ongoing reconciliation ? Confessedly, "watching for a state" in K8s is not a perfect fit, but I believe we can get it right 99% of the time if we can distinguish between an 'expected flow' and an 'error flow' of events. Having intermediate error events which are part of both is not helpful and leads to such complex workarounds that @markusthoemmes what would be your suggestion how to deal with this situation ? |
Yeah, I was wondering if the intermittent "Ingress reconcilation failed" is still happening. You mentioned above that you can't reproduce it anymore. |
Well, we haven't checked it and we don't suffer from it as we have a client-side fix for it (that 'error window'). So from the client POV its more a 'we don't know if this still occurs'. We would need to add some more debugging again or remove our fix. |
It may be reasonable to change our conformance/e2e to ensure we don't encounter such blips. ie.
This is sorta related to: #1178 when it comes to shoring up our tests |
What version of Knative?
Expected Behavior
No intermediate
Ready == False
should be received during a service update reconciliation if the reconciliation eventually finishes up withReady == True
Actual Behavior
Since 0.18.0 the client CI has a frequent flake (maybe 80% I would say) when updating a service after a series on operations on this service:
If we look at the service right after this error appeared it looks like:
so it already came over the
false
status forReady
(but kn considered this intermediate state as an error).You can see the status of the cluster at that time here
This issue with intermittent false ready states has been already discussed in #6784 but without a solution.
The current safeguard that we have implemented in the client (i.e. using an "error window" in which it has waited for another state change in case of an error) seems not to work here. We are investigating this in parallel on the client-side (on knative/client#1052
The question here though is: Why is this "ingress reconcile failed" event thrown at all and what has changed in serving that this happens now that often ?
Steps to Reproduce the Problem
See the steps in https://prow.knative.dev/view/gs/knative-prow/logs/ci-knative-client-auto-release/1313848987291226115#1:build-log.txt%3A3758 that lead to this error
The text was updated successfully, but these errors were encountered: