Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make error-window when waiting for a service to be ready configurable #1023

Closed
Tracked by #1580
cdlliuy opened this issue Sep 21, 2020 · 17 comments · Fixed by #1645
Closed
Tracked by #1580

Make error-window when waiting for a service to be ready configurable #1023

cdlliuy opened this issue Sep 21, 2020 · 17 comments · Fixed by #1645
Labels
good first issue Denotes an issue ready for a new contributor. kind/feature New feature or request triage/accepted Issues which should be fixed (post-triage)

Comments

@cdlliuy
Copy link

cdlliuy commented Sep 21, 2020

The use case described below in this description would be supported by a new option as described below


/kind question

we are running with cluster-autoscaler, and https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#how-can-i-configure-overprovisioning-with-cluster-autoscaler to put some low-priority pause pod in the cluster.
When the worker node cpu/memory are not nearly 99% (30% of them are occupied by the pause pods), we create a knative service with kn, and got error:

kn service create hello --image xxx --wait-timeout 300 --env TARGET=revision1 
Creating service 'hello' in namespace:
  0.380s The Route is still working to reflect the latest desired specification.
  1.253s Configuration "hello" is waiting for a Revision to become ready.
  4.249s Revision "hello-xxx" failed with message: 0/15 nodes are available: 1 Insufficient memory, 14 Insufficient cpu..
  5.077s Configuration "hello" does not have any ready Revision.
Error: RevisionFailed: Revision "hello-xxx" failed with message: 0/15 nodes are available: 1 Insufficient memory, 14 Insufficient cpu..

I checked with k8s scheduler team that the pod schedule will happen in 2 stage, the 1st pod placement attempt failed and the scheduler preempted low-priority Pods; then the 2nd pod placement attempt succeed. So as a result , the final knative service reconcile succeed.

But when using kn client, the end-user got the scary failed msg .. If the end-user don't have enough knowledge for the k8s reconcile, he/she will be frightened.

Another case is from some race condition case in knative itself. refer to :
knative/serving#8675
When the error is thrown out from kn client, the ksvc just created for 4 seconds. Later on, with more reconcile, the ksvc is finally ready.

So, I am wondering whether there are a better idea for watch to reduce these intermittent errors since reconcile is a designed behaviour of k8s.
maybe just adding a shorter wait time to see whether any condition change for the next reconcile?

@cdlliuy
Copy link
Author

cdlliuy commented Sep 21, 2020

@maximilien :-)

@cdlliuy
Copy link
Author

cdlliuy commented Sep 27, 2020

A proposal is to let "kn" client to wait as long as the --wait-timeout if we can't get a succeed msg from ksvc, and print out the intermittent error to end-user, i.e.

kn service create hello --image xxx --wait-timeout 300 --env TARGET=revision1 
Creating service 'hello' in namespace:
  0.380s The Route is still working to reflect the latest desired specification.
  1.253s Configuration "hello" is waiting for a Revision to become ready.
  4.249s Revision "hello-xxx" failed with message: 0/15 nodes are available: 1 Insufficient memory, 14 Insufficient cpu..
  5.077s Configuration "hello" does not have any ready Revision.
Error: RevisionFailed: Revision "hello-xxx" failed with message: 0/15 nodes are available: 1 Insufficient memory, 14 Insufficient cpu..

Error detected during knative service creation.  Continue to watch the latest reconcile progress .....  
.....
.....
.....

@rhuss
Copy link
Contributor

rhuss commented Sep 29, 2020

An issue with this kind of waiting for a resource to reach a certain status is when to stop and ignore intermediate failures. As Knative (as Kubernetes) are declarative platforms that are eventually consistent there is no clear point in time where to stop the check. We already added some 'grace' period in the checks that allow for some intermediate errors (if the status changes to green within this grace period). We can tune that like increasing it or exposing it as a parameter to the outside (but then still the user needs to tune that parameter and it is not clear that a single timeout value will always fix the timing problems).

For such advanced scenarios when the user knows that there are intermediate error conditions that are considered to be expected, then she can always use the async mode and wait for herself on the proper condition (e.g. by looping over kn service describe -o jsonpath ...)

@rhuss
Copy link
Contributor

rhuss commented Sep 29, 2020

I think the report of an initial failure to start the pod is valid (according to your description), kn has no idea that this is to be expected.

On the other hand, how long would it takes for preempting pods and starting the second pod ? Is this a matter of seconds or minutes ? Because if its only let's say 10s then we can increase the grace period, but if we would have to increase this wait time to minutes that would mean that kn would never return with an error before this grace time is over. Waiting minutes for a negative result is not a good user experience.

@cdlliuy
Copy link
Author

cdlliuy commented Sep 29, 2020

@rhuss , I think preempting pods and starting the second pod are quite quick in my case, just seconds , but it did relate to the workload of kube-scheduler, since the unschedule pod is requeued. I expect 10 seconds or more to be enough, but should not be minutes.

See below, 3 seconds in my case.

Warning  FailedScheduling      3m1s   default-scheduler      0/9 nodes are available: 9 Insufficient cpu.
Normal   Scheduled               2m58s  default-scheduler      Successfully assigned <pod> to  <node> ..

@cdlliuy
Copy link
Author

cdlliuy commented Oct 13, 2020

@rhuss , can you help to share a little bit your comment on this?

@rhuss
Copy link
Contributor

rhuss commented Oct 14, 2020

sure.

We just released a fix for the that 'grace period' in which intermediate errors are tolerated. That was unfortunately broken in certain situations. If you don't mind could you please try out kn 0.18.1 and whether it maybe already fixes the issue at hand ?

The current 'error window' as we call it is 2s. The problem with increasing this value is, if there is a real error it takes that seconds longer to get reported back to the user ? I wouldn't like to increase the value much more than those 2 seconds by default. However we could consider to add an option to tune this parameter, maybe as part of configuration option in the configuration file. Would this be helpful for you ?

@cdlliuy
Copy link
Author

cdlliuy commented Oct 23, 2020

@rhuss , sure, I will have a try for the 0.18.1 release and report back to you :-)

@cdlliuy
Copy link
Author

cdlliuy commented Dec 15, 2020

@rhuss , it seems that the waiting time here is not enough ... Also, there are more intermittent error failure may happen during reconcile. Will get it back on this issue with more information.

@rhuss
Copy link
Contributor

rhuss commented Dec 15, 2020

Yes, please. It's really difficult to find a reliable way that works in all circumstance in case of eventual success.

@cdlliuy
Copy link
Author

cdlliuy commented Jan 11, 2021

@rhuss , we did observe more intermittent reconcile failure as described in knative/serving#10511 .
You said The current 'error window' as we call it is 2s .
Is that possible to make the parameter as an option for end-user?

@rhuss
Copy link
Contributor

rhuss commented Jan 14, 2021

yes, let's do it as it seems that we can't have a value that works for every context. Let's call the flag --wait-window so that the name correlates to the "wait" functionality.

@cdlliuy
Copy link
Author

cdlliuy commented Jan 14, 2021

@rbuss, not sure whether the name wait-window will be confusing with wait-timeout?
Maybe --wait-window-on-errors ? or toleration-window ? I can't think out more given the limited English words in my mind.

@rhuss
Copy link
Contributor

rhuss commented Jan 14, 2021

the concept is hard to explain in one word anyway. It should start with --wait so that it aligns with the other wait option (--wait-timeout), so I would be fine with --wait-window and explaining it in the help message. It also a balancing act between being precise and too verbose (which leads to more typing and harder to memorize). Also, we already have the concept of a "window" included with --autoscale-window (which actually should be named --scale-window like the other autoscale parameters), so I would be fine with a --wait-window.

@github-actions
Copy link

This issue is stale because it has been open for 90 days with no
activity. It will automatically close after 30 more days of
inactivity. Reopen the issue with /reopen. Mark the issue as
fresh by adding the comment /remove-lifecycle stale.

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 15, 2021
@rhuss
Copy link
Contributor

rhuss commented Apr 15, 2021

/remove-lifecycle stale

@knative-prow-robot knative-prow-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 15, 2021
@rhuss rhuss changed the title Is that possible to wait for a little bit longer with an error is thrown by knative service CR? Make error-window when waiting for a service to be ready configurable Jul 9, 2021
@rhuss
Copy link
Contributor

rhuss commented Jul 9, 2021

I renamed the issue to describe what needs to be done (i.e. make our workaround for intermediate failures configurable). This issue is also related to knative/serving#9727

@rhuss rhuss added kind/feature New feature or request good first issue Denotes an issue ready for a new contributor. triage/accepted Issues which should be fixed (post-triage) labels Jul 9, 2021
@rhuss rhuss moved this to Icebox in Client Planning Jan 4, 2022
dsimansk added a commit to dsimansk/client that referenced this issue Apr 13, 2022
* [release-v1.1.0] Update kn-plugin-func to v0.23.1

* Update vendor dir
Repository owner moved this from Backlog to Done in Client Planning Apr 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Denotes an issue ready for a new contributor. kind/feature New feature or request triage/accepted Issues which should be fixed (post-triage)
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

3 participants