-
Notifications
You must be signed in to change notification settings - Fork 263
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make error-window when waiting for a service to be ready configurable #1023
Comments
@maximilien :-) |
A proposal is to let "kn" client to wait as long as the
|
An issue with this kind of waiting for a resource to reach a certain status is when to stop and ignore intermediate failures. As Knative (as Kubernetes) are declarative platforms that are eventually consistent there is no clear point in time where to stop the check. We already added some 'grace' period in the checks that allow for some intermediate errors (if the status changes to green within this grace period). We can tune that like increasing it or exposing it as a parameter to the outside (but then still the user needs to tune that parameter and it is not clear that a single timeout value will always fix the timing problems). For such advanced scenarios when the user knows that there are intermediate error conditions that are considered to be expected, then she can always use the async mode and wait for herself on the proper condition (e.g. by looping over |
I think the report of an initial failure to start the pod is valid (according to your description), On the other hand, how long would it takes for preempting pods and starting the second pod ? Is this a matter of seconds or minutes ? Because if its only let's say 10s then we can increase the grace period, but if we would have to increase this wait time to minutes that would mean that kn would never return with an error before this grace time is over. Waiting minutes for a negative result is not a good user experience. |
@rhuss , I think preempting pods and starting the second pod are quite quick in my case, just seconds , but it did relate to the workload of kube-scheduler, since the unschedule pod is requeued. I expect 10 seconds or more to be enough, but should not be minutes. See below, 3 seconds in my case.
|
@rhuss , can you help to share a little bit your comment on this? |
sure. We just released a fix for the that 'grace period' in which intermediate errors are tolerated. That was unfortunately broken in certain situations. If you don't mind could you please try out kn 0.18.1 and whether it maybe already fixes the issue at hand ? The current 'error window' as we call it is 2s. The problem with increasing this value is, if there is a real error it takes that seconds longer to get reported back to the user ? I wouldn't like to increase the value much more than those 2 seconds by default. However we could consider to add an option to tune this parameter, maybe as part of configuration option in the configuration file. Would this be helpful for you ? |
@rhuss , sure, I will have a try for the 0.18.1 release and report back to you :-) |
@rhuss , it seems that the waiting time here is not enough ... Also, there are more intermittent error failure may happen during reconcile. Will get it back on this issue with more information. |
Yes, please. It's really difficult to find a reliable way that works in all circumstance in case of eventual success. |
@rhuss , we did observe more intermittent reconcile failure as described in knative/serving#10511 . |
yes, let's do it as it seems that we can't have a value that works for every context. Let's call the flag |
@rbuss, not sure whether the name |
the concept is hard to explain in one word anyway. It should start with |
This issue is stale because it has been open for 90 days with no |
/remove-lifecycle stale |
I renamed the issue to describe what needs to be done (i.e. make our workaround for intermediate failures configurable). This issue is also related to knative/serving#9727 |
* [release-v1.1.0] Update kn-plugin-func to v0.23.1 * Update vendor dir
The use case described below in this description would be supported by a new option as described below
we are running with cluster-autoscaler, and https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#how-can-i-configure-overprovisioning-with-cluster-autoscaler to put some low-priority pause pod in the cluster.
When the worker node cpu/memory are not nearly 99% (30% of them are occupied by the pause pods), we create a knative service with kn, and got error:
I checked with k8s scheduler team that the pod schedule will happen in 2 stage, the 1st pod placement attempt failed and the scheduler preempted low-priority Pods; then the 2nd pod placement attempt succeed. So as a result , the final knative service reconcile succeed.
But when using kn client, the end-user got the scary
failed
msg .. If the end-user don't have enough knowledge for the k8s reconcile, he/she will be frightened.Another case is from some race condition case in knative itself. refer to :
knative/serving#8675
When the error is thrown out from kn client, the ksvc just created for 4 seconds. Later on, with more reconcile, the ksvc is finally ready.
So, I am wondering whether there are a better idea for watch to reduce these intermittent errors since reconcile is a designed behaviour of k8s.
maybe just adding a shorter wait time to see whether any condition change for the next reconcile?
The text was updated successfully, but these errors were encountered: