Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intermediate "Ingress Reconciliation failed" event received during reconciliation #9727

Open
rhuss opened this issue Oct 7, 2020 · 13 comments
Assignees
Labels
area/API API objects and controllers kind/bug Categorizes issue or PR as related to a bug. triage/accepted Issues which should be fixed (post-triage)
Milestone

Comments

@rhuss
Copy link
Contributor

rhuss commented Oct 7, 2020

What version of Knative?

0.18

Expected Behavior

No intermediate Ready == False should be received during a service update reconciliation if the reconciliation eventually finishes up with Ready == True

Actual Behavior

Since 0.18.0 the client CI has a frequent flake (maybe 80% I would say) when updating a service after a series on operations on this service:

       ┣━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
        🦆 kn service update svc3a -a alpha=direwolf -a brave- --namespace kne2etests23
        ┃ Updating Service 'svc3a' in namespace 'kne2etests23':
        ┃ 
        ┃   0.059s The Configuration is still working to reflect the latest desired specification.
        ┃   0.480s Ingress reconciliation failed
        ┃ 
        🔥 Error: ReconcileIngressFailed: Ingress reconciliation failed
        🔥 Run 'kn --help' for usage
        🔥 

If we look at the service right after this error appeared it looks like:

      Conditions:
            Last Transition Time:        2020-10-07T14:50:48Z
            Status:                      Unknown
            Type:                        ConfigurationsReady
            Last Transition Time:        2020-10-07T14:50:48Z
            Status:                      Unknown
            Type:                        Ready
            Last Transition Time:        2020-10-07T14:50:48Z
            Status:                      True
            Type:                        RoutesReady

so it already came over the false status for Ready (but kn considered this intermediate state as an error).
You can see the status of the cluster at that time here

This issue with intermittent false ready states has been already discussed in #6784 but without a solution.

The current safeguard that we have implemented in the client (i.e. using an "error window" in which it has waited for another state change in case of an error) seems not to work here. We are investigating this in parallel on the client-side (on knative/client#1052

The question here though is: Why is this "ingress reconcile failed" event thrown at all and what has changed in serving that this happens now that often ?

Steps to Reproduce the Problem

See the steps in https://prow.knative.dev/view/gs/knative-prow/logs/ci-knative-client-auto-release/1313848987291226115#1:build-log.txt%3A3758 that lead to this error

@rhuss rhuss added the kind/bug Categorizes issue or PR as related to a bug. label Oct 7, 2020
@rhuss
Copy link
Contributor Author

rhuss commented Oct 8, 2020

Here is the flow of events that lead to this error:

 ┣━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
        🦆 kn service update svc3 --annotation alpha=direwolf --annotation brave- --namespace kne2etests6
        ┃ Updating Service 'svc3' in namespace 'kne2etests6':
        ┃ 
        ┃ RCV conditions: ConfigurationsReady - OutOfDate - The Configuration is still working to reflect the latest desired specification.
        ┃ RCV conditions: Ready - OutOfDate - The Configuration is still working to reflect the latest desired specification.
        ┃   0.041s The Configuration is still working to reflect the latest desired specification.
        ┃ RCV conditions: RoutesReady -  - 
        ┃ RCV conditions: ConfigurationsReady -  - 
        ┃ RCV conditions: Ready -  - 
        ┃ RCV conditions: RoutesReady -  - 
        ┃ RCV conditions: ConfigurationsReady -  - 
-->  ┃ RCV conditions: Ready - ReconcileIngressFailed - Ingress reconciliation failed
        ┃ Received error - ReconcileIngressFailed : Ingress reconciliation failed (Error Window: 2s, error timer: <nil>)
        ┃   0.378s Ingress reconciliation failed
        ┃ RCV conditions: RoutesReady - ReconcileIngressFailed - Ingress reconciliation failed
        ┃ RCV conditions: ConfigurationsReady -  - 
        ┃ RCV conditions: Ready -  - 
        ┃ RCV conditions: ConfigurationsReady -  - 
        ┃ RCV conditions: Ready - TrafficNotMigrated - Traffic is not yet migrated to the latest revision.
        ┃   8.445s Traffic is not yet migrated to the latest revision.
        ┃ RCV conditions: RoutesReady - TrafficNotMigrated - Traffic is not yet migrated to the latest revision.
        ┃ RCV conditions: ConfigurationsReady -  - 
        ┃ RCV conditions: Ready - IngressNotConfigured - Ingress has not yet been reconciled.
        ┃   8.736s Ingress has not yet been reconciled.
        ┃ RCV conditions: RoutesReady - IngressNotConfigured - Ingress has not yet been reconciled.
        ┃ RCV conditions: ConfigurationsReady -  - 
        ┃ RCV conditions: Ready - Uninitialized - Waiting for load balancer to be ready
        ┃   8.865s Waiting for load balancer to be ready
        ┃ RCV conditions: RoutesReady - Uninitialized - Waiting for load balancer to be ready
        ┃ RCV conditions: ConfigurationsReady -  - 
        ┃ RCV conditions: Ready -  - 
        ┃   9.638s Ready to serve.
        ┃ 
        ┃ Service 'svc3' updated to latest revision 'svc3-ppstg-2' is available at URL:
        ┃ http://svc3.kne2etests6.example.com
        ┃ 

The event at the line with the arrow indicates when the ReconcileIngressFailed event was received.

@markusthoemmes
Copy link
Contributor

Which Ingress plugin are you using here? This seems like the respective plugin is misbehaving in that it reports Ready = False in a non-terminal condition.

@sneko
Copy link

sneko commented Dec 7, 2020

I experienced that on Google Cloud Run (GKE platform), I just got:

Ingress reconciliation failed

Knative serving: v0.17.2-gke.5

@rhuss
Copy link
Contributor Author

rhuss commented Dec 7, 2020

Which Ingress plugin are you using here?

Kourier affair, but that was already since quite some time (sorry for the late reply @markusthoemmes , just overlooked your question :)

@markusthoemmes
Copy link
Contributor

If it's Kourier, is it still an issue at HEAD then? I've recently done quite a few changes to Kourier's status reporting.

@rhuss
Copy link
Contributor Author

rhuss commented Dec 8, 2020

Unfortunately, I can't reproduce it anymore as we introduced a client safeguard for this kind of situations (with knative/client#1052)

For our needs this is good enough, so from my side, the issue could be closed. Not sure if the other reference to the Google Cloud Run error really applies here to use (it's hard to test anyway), or whether it would be better to hunt that error from the CloudRun side (with Google support channels for cloud run).

@aaron-lerner
Copy link

Are all clients expected to implement an error window like kn that can be used as a workaround for this issue? In clients like gcloud for example, I believe False is always considered a terminal failure, so it remains susceptible to this issue.

@evankanderson
Copy link
Member

/area API

/assign @dprotaso

I'm assuming that we consider this a bug. 😁

/triage accepted

@markusthoemmes
Copy link
Contributor

I wonder if this is even still a bug 🤔

@rhuss
Copy link
Contributor Author

rhuss commented Jun 1, 2021

I wonder if this is even still a bug 🤔

It might be true that a client needs to be resilient for this flow of events that always can occur in an eventually-consistent system like a declarative reconciliation platform as Kubernetes is. But if "Reconciliation Failed" is an intermediate and non an error state, that this should be detectable somehow. For any client it would be very helpful to get some more insights on the reconciliation process: Is it a "terminal" error that is likely to last for some time ? Is it an "intermediate state" which just can be considered to be a tag for an ongoing reconciliation ? Confessedly, "watching for a state" in K8s is not a perfect fit, but I believe we can get it right 99% of the time if we can distinguish between an 'expected flow' and an 'error flow' of events. Having intermediate error events which are part of both is not helpful and leads to such complex workarounds that kn has introduced and that every other client would need to introduce to decide whether a declared state change has been reconciled or not. A timeout could be helpful but that would unnecessarily delay and block the execution if a failure can be detected that won't be fixed until the end of the timeout (that typically is in the range of minutes). Some help server side would be really useful (and if we only would add an 'error assessment' how likely it is that the current state is a terminal state that won't be fixed without another update to the resource itself).

@markusthoemmes what would be your suggestion how to deal with this situation ?

@markusthoemmes
Copy link
Contributor

Yeah, I was wondering if the intermittent "Ingress reconcilation failed" is still happening. You mentioned above that you can't reproduce it anymore.

@rhuss
Copy link
Contributor Author

rhuss commented Jun 1, 2021

Well, we haven't checked it and we don't suffer from it as we have a client-side fix for it (that 'error window'). So from the client POV its more a 'we don't know if this still occurs'.

We would need to add some more debugging again or remove our fix.

@dprotaso
Copy link
Member

dprotaso commented Jun 12, 2021

It may be reasonable to change our conformance/e2e to ensure we don't encounter such blips.

ie.

  1. Create Service
  2. Wait for Ready
  3. Watch for Status Changes
  4. Perform an Update

This is sorta related to: #1178 when it comes to shoring up our tests

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/API API objects and controllers kind/bug Categorizes issue or PR as related to a bug. triage/accepted Issues which should be fixed (post-triage)
Projects
None yet
Development

No branches or pull requests

7 participants