Make error-window when waiting for a service to be ready configurable #1023

cdlliuy · 2020-09-21T08:05:19Z

The use case described below in this description would be supported by a new option as described below

/kind question

we are running with cluster-autoscaler, and https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#how-can-i-configure-overprovisioning-with-cluster-autoscaler to put some low-priority pause pod in the cluster.
When the worker node cpu/memory are not nearly 99% (30% of them are occupied by the pause pods), we create a knative service with kn, and got error:

kn service create hello --image xxx --wait-timeout 300 --env TARGET=revision1 
Creating service 'hello' in namespace:
  0.380s The Route is still working to reflect the latest desired specification.
  1.253s Configuration "hello" is waiting for a Revision to become ready.
  4.249s Revision "hello-xxx" failed with message: 0/15 nodes are available: 1 Insufficient memory, 14 Insufficient cpu..
  5.077s Configuration "hello" does not have any ready Revision.
Error: RevisionFailed: Revision "hello-xxx" failed with message: 0/15 nodes are available: 1 Insufficient memory, 14 Insufficient cpu..

I checked with k8s scheduler team that the pod schedule will happen in 2 stage, the 1st pod placement attempt failed and the scheduler preempted low-priority Pods; then the 2nd pod placement attempt succeed. So as a result , the final knative service reconcile succeed.

But when using kn client, the end-user got the scary failed msg .. If the end-user don't have enough knowledge for the k8s reconcile, he/she will be frightened.

Another case is from some race condition case in knative itself. refer to :
knative/serving#8675
When the error is thrown out from kn client, the ksvc just created for 4 seconds. Later on, with more reconcile, the ksvc is finally ready.

So, I am wondering whether there are a better idea for watch to reduce these intermittent errors since reconcile is a designed behaviour of k8s.
maybe just adding a shorter wait time to see whether any condition change for the next reconcile?

The text was updated successfully, but these errors were encountered:

cdlliuy · 2020-09-21T08:29:38Z

@maximilien :-）

cdlliuy · 2020-09-27T01:40:36Z

A proposal is to let "kn" client to wait as long as the --wait-timeout if we can't get a succeed msg from ksvc, and print out the intermittent error to end-user, i.e.

kn service create hello --image xxx --wait-timeout 300 --env TARGET=revision1 
Creating service 'hello' in namespace:
  0.380s The Route is still working to reflect the latest desired specification.
  1.253s Configuration "hello" is waiting for a Revision to become ready.
  4.249s Revision "hello-xxx" failed with message: 0/15 nodes are available: 1 Insufficient memory, 14 Insufficient cpu..
  5.077s Configuration "hello" does not have any ready Revision.
Error: RevisionFailed: Revision "hello-xxx" failed with message: 0/15 nodes are available: 1 Insufficient memory, 14 Insufficient cpu..

Error detected during knative service creation.  Continue to watch the latest reconcile progress .....  
.....
.....
.....

rhuss · 2020-09-29T13:01:06Z

An issue with this kind of waiting for a resource to reach a certain status is when to stop and ignore intermediate failures. As Knative (as Kubernetes) are declarative platforms that are eventually consistent there is no clear point in time where to stop the check. We already added some 'grace' period in the checks that allow for some intermediate errors (if the status changes to green within this grace period). We can tune that like increasing it or exposing it as a parameter to the outside (but then still the user needs to tune that parameter and it is not clear that a single timeout value will always fix the timing problems).

For such advanced scenarios when the user knows that there are intermediate error conditions that are considered to be expected, then she can always use the async mode and wait for herself on the proper condition (e.g. by looping over kn service describe -o jsonpath ...)

rhuss · 2020-09-29T13:04:43Z

I think the report of an initial failure to start the pod is valid (according to your description), kn has no idea that this is to be expected.

On the other hand, how long would it takes for preempting pods and starting the second pod ? Is this a matter of seconds or minutes ? Because if its only let's say 10s then we can increase the grace period, but if we would have to increase this wait time to minutes that would mean that kn would never return with an error before this grace time is over. Waiting minutes for a negative result is not a good user experience.

cdlliuy · 2020-09-29T13:39:18Z

@rhuss , I think preempting pods and starting the second pod are quite quick in my case, just seconds , but it did relate to the workload of kube-scheduler, since the unschedule pod is requeued. I expect 10 seconds or more to be enough, but should not be minutes.

See below, 3 seconds in my case.

Warning  FailedScheduling      3m1s   default-scheduler      0/9 nodes are available: 9 Insufficient cpu.
Normal   Scheduled               2m58s  default-scheduler      Successfully assigned <pod> to  <node> ..

cdlliuy · 2020-10-13T14:00:44Z

@rhuss , can you help to share a little bit your comment on this?

rhuss · 2020-10-14T09:41:16Z

sure.

We just released a fix for the that 'grace period' in which intermediate errors are tolerated. That was unfortunately broken in certain situations. If you don't mind could you please try out kn 0.18.1 and whether it maybe already fixes the issue at hand ?

The current 'error window' as we call it is 2s. The problem with increasing this value is, if there is a real error it takes that seconds longer to get reported back to the user ? I wouldn't like to increase the value much more than those 2 seconds by default. However we could consider to add an option to tune this parameter, maybe as part of configuration option in the configuration file. Would this be helpful for you ?

cdlliuy · 2020-10-23T08:17:44Z

@rhuss , sure, I will have a try for the 0.18.1 release and report back to you :-)

cdlliuy · 2020-12-15T13:30:33Z

@rhuss , it seems that the waiting time here is not enough ... Also, there are more intermittent error failure may happen during reconcile. Will get it back on this issue with more information.

rhuss · 2020-12-15T18:03:40Z

Yes, please. It's really difficult to find a reliable way that works in all circumstance in case of eventual success.

cdlliuy · 2021-01-11T02:20:41Z

@rhuss , we did observe more intermittent reconcile failure as described in knative/serving#10511 .
You said The current 'error window' as we call it is 2s .
Is that possible to make the parameter as an option for end-user?

rhuss · 2021-01-14T07:37:43Z

yes, let's do it as it seems that we can't have a value that works for every context. Let's call the flag --wait-window so that the name correlates to the "wait" functionality.

cdlliuy · 2021-01-14T08:09:16Z

@rbuss, not sure whether the name wait-window will be confusing with wait-timeout?
Maybe --wait-window-on-errors ? or toleration-window ? I can't think out more given the limited English words in my mind.

rhuss · 2021-01-14T08:13:14Z

the concept is hard to explain in one word anyway. It should start with --wait so that it aligns with the other wait option (--wait-timeout), so I would be fine with --wait-window and explaining it in the help message. It also a balancing act between being precise and too verbose (which leads to more typing and harder to memorize). Also, we already have the concept of a "window" included with --autoscale-window (which actually should be named --scale-window like the other autoscale parameters), so I would be fine with a --wait-window.

github-actions · 2021-04-15T01:29:39Z

This issue is stale because it has been open for 90 days with no
activity. It will automatically close after 30 more days of
inactivity. Reopen the issue with /reopen. Mark the issue as
fresh by adding the comment /remove-lifecycle stale.

rhuss · 2021-04-15T06:36:17Z

/remove-lifecycle stale

rhuss · 2021-07-09T10:48:09Z

I renamed the issue to describe what needs to be done (i.e. make our workaround for intermediate failures configurable). This issue is also related to knative/serving#9727

* [release-v1.1.0] Update kn-plugin-func to v0.23.1 * Update vendor dir

cdlliuy mentioned this issue Nov 10, 2020

Avoid intermittent schedule failure msg to end-user knative/serving#10076

Closed

ShiningTian mentioned this issue Jan 11, 2021

Knative reconcile fails in-middle, but be ready eventually for service creation knative/serving#10511

Open

github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 15, 2021

knative-prow-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 15, 2021

rhuss changed the title ~~Is that possible to wait for a little bit longer with an error is thrown by knative service CR?~~ Make error-window when waiting for a service to be ready configurable Jul 9, 2021

rhuss added kind/feature New feature or request good first issue Denotes an issue ready for a new contributor. triage/accepted Issues which should be fixed (post-triage) labels Jul 9, 2021

rhuss added this to Client Planning Jan 4, 2022

rhuss moved this to Icebox in Client Planning Jan 4, 2022

rhuss mentioned this issue Jan 27, 2022

Complete options for service commands #1580

Closed

5 tasks

vyasgun mentioned this issue Apr 11, 2022

Added flag to configure wait-window between intermediate errors durin… #1645

Merged

dsimansk added a commit to dsimansk/client that referenced this issue Apr 13, 2022

[release-v1.1.0] Update kn-plugin-func to v0.23.1 (knative#1023)

7832803

* [release-v1.1.0] Update kn-plugin-func to v0.23.1 * Update vendor dir

knative-prow bot closed this as completed in #1645 Apr 13, 2022

Repository owner moved this from Backlog to Done in Client Planning Apr 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make error-window when waiting for a service to be ready configurable #1023

Make error-window when waiting for a service to be ready configurable #1023

cdlliuy commented Sep 21, 2020 •

edited by rhuss

Loading

cdlliuy commented Sep 21, 2020

cdlliuy commented Sep 27, 2020

rhuss commented Sep 29, 2020

rhuss commented Sep 29, 2020

cdlliuy commented Sep 29, 2020 •

edited

Loading

cdlliuy commented Oct 13, 2020

rhuss commented Oct 14, 2020 •

edited

Loading

cdlliuy commented Oct 23, 2020

cdlliuy commented Dec 15, 2020

rhuss commented Dec 15, 2020

cdlliuy commented Jan 11, 2021

rhuss commented Jan 14, 2021

cdlliuy commented Jan 14, 2021

rhuss commented Jan 14, 2021

github-actions bot commented Apr 15, 2021

rhuss commented Apr 15, 2021

rhuss commented Jul 9, 2021

Make error-window when waiting for a service to be ready configurable #1023

Make error-window when waiting for a service to be ready configurable #1023

Comments

cdlliuy commented Sep 21, 2020 • edited by rhuss Loading

cdlliuy commented Sep 21, 2020

cdlliuy commented Sep 27, 2020

rhuss commented Sep 29, 2020

rhuss commented Sep 29, 2020

cdlliuy commented Sep 29, 2020 • edited Loading

cdlliuy commented Oct 13, 2020

rhuss commented Oct 14, 2020 • edited Loading

cdlliuy commented Oct 23, 2020

cdlliuy commented Dec 15, 2020

rhuss commented Dec 15, 2020

cdlliuy commented Jan 11, 2021

rhuss commented Jan 14, 2021

cdlliuy commented Jan 14, 2021

rhuss commented Jan 14, 2021

github-actions bot commented Apr 15, 2021

rhuss commented Apr 15, 2021

rhuss commented Jul 9, 2021

cdlliuy commented Sep 21, 2020 •

edited by rhuss

Loading

cdlliuy commented Sep 29, 2020 •

edited

Loading

rhuss commented Oct 14, 2020 •

edited

Loading