Intermediate "Ingress Reconciliation failed" event received during reconciliation #9727

rhuss · 2020-10-07T17:38:19Z

What version of Knative?

0.18

Expected Behavior

No intermediate Ready == False should be received during a service update reconciliation if the reconciliation eventually finishes up with Ready == True

Actual Behavior

Since 0.18.0 the client CI has a frequent flake (maybe 80% I would say) when updating a service after a series on operations on this service:

       ┣━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
        🦆 kn service update svc3a -a alpha=direwolf -a brave- --namespace kne2etests23
        ┃ Updating Service 'svc3a' in namespace 'kne2etests23':
        ┃ 
        ┃   0.059s The Configuration is still working to reflect the latest desired specification.
        ┃   0.480s Ingress reconciliation failed
        ┃ 
        🔥 Error: ReconcileIngressFailed: Ingress reconciliation failed
        🔥 Run 'kn --help' for usage
        🔥

If we look at the service right after this error appeared it looks like:

      Conditions:
            Last Transition Time:        2020-10-07T14:50:48Z
            Status:                      Unknown
            Type:                        ConfigurationsReady
            Last Transition Time:        2020-10-07T14:50:48Z
            Status:                      Unknown
            Type:                        Ready
            Last Transition Time:        2020-10-07T14:50:48Z
            Status:                      True
            Type:                        RoutesReady

so it already came over the false status for Ready (but kn considered this intermediate state as an error).
You can see the status of the cluster at that time here

This issue with intermittent false ready states has been already discussed in #6784 but without a solution.

The current safeguard that we have implemented in the client (i.e. using an "error window" in which it has waited for another state change in case of an error) seems not to work here. We are investigating this in parallel on the client-side (on knative/client#1052

The question here though is: Why is this "ingress reconcile failed" event thrown at all and what has changed in serving that this happens now that often ?

Steps to Reproduce the Problem

See the steps in https://prow.knative.dev/view/gs/knative-prow/logs/ci-knative-client-auto-release/1313848987291226115#1:build-log.txt%3A3758 that lead to this error

The text was updated successfully, but these errors were encountered:

rhuss · 2020-10-08T09:39:27Z

Here is the flow of events that lead to this error:

 ┣━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
        🦆 kn service update svc3 --annotation alpha=direwolf --annotation brave- --namespace kne2etests6
        ┃ Updating Service 'svc3' in namespace 'kne2etests6':
        ┃ 
        ┃ RCV conditions: ConfigurationsReady - OutOfDate - The Configuration is still working to reflect the latest desired specification.
        ┃ RCV conditions: Ready - OutOfDate - The Configuration is still working to reflect the latest desired specification.
        ┃   0.041s The Configuration is still working to reflect the latest desired specification.
        ┃ RCV conditions: RoutesReady -  - 
        ┃ RCV conditions: ConfigurationsReady -  - 
        ┃ RCV conditions: Ready -  - 
        ┃ RCV conditions: RoutesReady -  - 
        ┃ RCV conditions: ConfigurationsReady -  - 
-->  ┃ RCV conditions: Ready - ReconcileIngressFailed - Ingress reconciliation failed
        ┃ Received error - ReconcileIngressFailed : Ingress reconciliation failed (Error Window: 2s, error timer: <nil>)
        ┃   0.378s Ingress reconciliation failed
        ┃ RCV conditions: RoutesReady - ReconcileIngressFailed - Ingress reconciliation failed
        ┃ RCV conditions: ConfigurationsReady -  - 
        ┃ RCV conditions: Ready -  - 
        ┃ RCV conditions: ConfigurationsReady -  - 
        ┃ RCV conditions: Ready - TrafficNotMigrated - Traffic is not yet migrated to the latest revision.
        ┃   8.445s Traffic is not yet migrated to the latest revision.
        ┃ RCV conditions: RoutesReady - TrafficNotMigrated - Traffic is not yet migrated to the latest revision.
        ┃ RCV conditions: ConfigurationsReady -  - 
        ┃ RCV conditions: Ready - IngressNotConfigured - Ingress has not yet been reconciled.
        ┃   8.736s Ingress has not yet been reconciled.
        ┃ RCV conditions: RoutesReady - IngressNotConfigured - Ingress has not yet been reconciled.
        ┃ RCV conditions: ConfigurationsReady -  - 
        ┃ RCV conditions: Ready - Uninitialized - Waiting for load balancer to be ready
        ┃   8.865s Waiting for load balancer to be ready
        ┃ RCV conditions: RoutesReady - Uninitialized - Waiting for load balancer to be ready
        ┃ RCV conditions: ConfigurationsReady -  - 
        ┃ RCV conditions: Ready -  - 
        ┃   9.638s Ready to serve.
        ┃ 
        ┃ Service 'svc3' updated to latest revision 'svc3-ppstg-2' is available at URL:
        ┃ http://svc3.kne2etests6.example.com
        ┃

The event at the line with the arrow indicates when the ReconcileIngressFailed event was received.

markusthoemmes · 2020-10-08T10:08:34Z

Which Ingress plugin are you using here? This seems like the respective plugin is misbehaving in that it reports Ready = False in a non-terminal condition.

sneko · 2020-12-07T14:17:00Z

I experienced that on Google Cloud Run (GKE platform), I just got:

Ingress reconciliation failed

Knative serving: v0.17.2-gke.5

rhuss · 2020-12-07T21:37:00Z

Which Ingress plugin are you using here?

Kourier affair, but that was already since quite some time (sorry for the late reply @markusthoemmes , just overlooked your question :)

markusthoemmes · 2020-12-08T07:53:54Z

If it's Kourier, is it still an issue at HEAD then? I've recently done quite a few changes to Kourier's status reporting.

rhuss · 2020-12-08T09:23:02Z

Unfortunately, I can't reproduce it anymore as we introduced a client safeguard for this kind of situations (with knative/client#1052)

For our needs this is good enough, so from my side, the issue could be closed. Not sure if the other reference to the Google Cloud Run error really applies here to use (it's hard to test anyway), or whether it would be better to hunt that error from the CloudRun side (with Google support channels for cloud run).

aaron-lerner · 2020-12-28T19:15:58Z

Are all clients expected to implement an error window like kn that can be used as a workaround for this issue? In clients like gcloud for example, I believe False is always considered a terminal failure, so it remains susceptible to this issue.

evankanderson · 2021-03-22T01:22:44Z

/area API

/assign @dprotaso

I'm assuming that we consider this a bug. 😁

/triage accepted

markusthoemmes · 2021-05-25T12:34:08Z

I wonder if this is even still a bug 🤔

rhuss · 2021-06-01T07:39:02Z

I wonder if this is even still a bug 🤔

It might be true that a client needs to be resilient for this flow of events that always can occur in an eventually-consistent system like a declarative reconciliation platform as Kubernetes is. But if "Reconciliation Failed" is an intermediate and non an error state, that this should be detectable somehow. For any client it would be very helpful to get some more insights on the reconciliation process: Is it a "terminal" error that is likely to last for some time ? Is it an "intermediate state" which just can be considered to be a tag for an ongoing reconciliation ? Confessedly, "watching for a state" in K8s is not a perfect fit, but I believe we can get it right 99% of the time if we can distinguish between an 'expected flow' and an 'error flow' of events. Having intermediate error events which are part of both is not helpful and leads to such complex workarounds that kn has introduced and that every other client would need to introduce to decide whether a declared state change has been reconciled or not. A timeout could be helpful but that would unnecessarily delay and block the execution if a failure can be detected that won't be fixed until the end of the timeout (that typically is in the range of minutes). Some help server side would be really useful (and if we only would add an 'error assessment' how likely it is that the current state is a terminal state that won't be fixed without another update to the resource itself).

@markusthoemmes what would be your suggestion how to deal with this situation ?

markusthoemmes · 2021-06-01T10:25:26Z

Yeah, I was wondering if the intermittent "Ingress reconcilation failed" is still happening. You mentioned above that you can't reproduce it anymore.

rhuss · 2021-06-01T12:05:56Z

Well, we haven't checked it and we don't suffer from it as we have a client-side fix for it (that 'error window'). So from the client POV its more a 'we don't know if this still occurs'.

We would need to add some more debugging again or remove our fix.

dprotaso · 2021-06-12T01:12:16Z

It may be reasonable to change our conformance/e2e to ensure we don't encounter such blips.

ie.

Create Service
Wait for Ready
Watch for Status Changes
Perform an Update

This is sorta related to: #1178 when it comes to shoring up our tests

rhuss added the kind/bug Categorizes issue or PR as related to a bug. label Oct 7, 2020

rhuss mentioned this issue Oct 8, 2020

Fix for test flake when sync waiting and an intermediate error occurs knative/client#1052

Merged

knative-prow-robot assigned dprotaso Mar 22, 2021

knative-prow-robot added area/API API objects and controllers triage/accepted Issues which should be fixed (post-triage) labels Mar 22, 2021

jakoberpf mentioned this issue May 3, 2021

Knative operator looks for "istio-system" namespace when custom istio gateway is set knative/operator#565

Closed

dprotaso added this to the Ice Box milestone Jun 12, 2021

rhuss mentioned this issue Jul 9, 2021

Make error-window when waiting for a service to be ready configurable knative/client#1023

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Intermediate "Ingress Reconciliation failed" event received during reconciliation #9727

Intermediate "Ingress Reconciliation failed" event received during reconciliation #9727

rhuss commented Oct 7, 2020 •

edited

Loading

rhuss commented Oct 8, 2020

markusthoemmes commented Oct 8, 2020

sneko commented Dec 7, 2020 •

edited

Loading

rhuss commented Dec 7, 2020

markusthoemmes commented Dec 8, 2020

rhuss commented Dec 8, 2020

aaron-lerner commented Dec 28, 2020

evankanderson commented Mar 22, 2021

markusthoemmes commented May 25, 2021

rhuss commented Jun 1, 2021 •

edited

Loading

markusthoemmes commented Jun 1, 2021

rhuss commented Jun 1, 2021

dprotaso commented Jun 12, 2021 •

edited

Loading

Intermediate "Ingress Reconciliation failed" event received during reconciliation #9727

Intermediate "Ingress Reconciliation failed" event received during reconciliation #9727

Comments

rhuss commented Oct 7, 2020 • edited Loading

What version of Knative?

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

rhuss commented Oct 8, 2020

markusthoemmes commented Oct 8, 2020

sneko commented Dec 7, 2020 • edited Loading

rhuss commented Dec 7, 2020

markusthoemmes commented Dec 8, 2020

rhuss commented Dec 8, 2020

aaron-lerner commented Dec 28, 2020

evankanderson commented Mar 22, 2021

markusthoemmes commented May 25, 2021

rhuss commented Jun 1, 2021 • edited Loading

markusthoemmes commented Jun 1, 2021

rhuss commented Jun 1, 2021

dprotaso commented Jun 12, 2021 • edited Loading

rhuss commented Oct 7, 2020 •

edited

Loading

sneko commented Dec 7, 2020 •

edited

Loading

rhuss commented Jun 1, 2021 •

edited

Loading

dprotaso commented Jun 12, 2021 •

edited

Loading